AI Technology RadarAI Technology Radar

Model Benchmarks & Leaderboards

benchmarkingknowledge
Adopt

Model benchmarks and leaderboards are available for various evaluation dimensions relevant to model choosing and discovery.

In this article we will provide an overview of the most relevant benchmarks and leaderboards, as well as a summary of model limitations that one should have in mind.

Performance & Benchmarks (MMLU etc.):

As large language models (LLMs) and generative AI (GenAI) are rapidly advancing, robust and targeted benchmarks are essential to assess their real-world capabilities, domain-specific expertise, reasoning power, thrustworthiness and environmental impact. The following benchmarks represent a selection of most relevant benchmarks used to evaluate the next generation of AI models.

Overview Important Benchmarks:

Category Benchmark(s) Purpose & Focus Area
General Knowledge & Reasoning MMLU-Pro, GPQA, BIG-Bench Hard, AGIEval, Humanity’s Last Exam Measures domain-specific and general expertise (law, physics, medicine) at different difficulty.
Mathematical Reasoning Math (OpenAI), MathVista, FrontierMath Symbolic and numerical reasoning, including chain-of-thought and visual math. Mathematic task solving.
Instruction Following & Multi-Hop Logic IFEval, MUSR, LongBench Evaluates complex instruction following, task planning, and stepwise logic
Data Analytics & Querying BIRD, DataSciBench, Spider 2.0 SQL generation, structured data understanding, table-to-text generation
Coding, Software & Data Science LiveCodeBench, DS-1000, CodeContests, MultiPL-E, DSBench Functional correctness of code generation, Complex algorithmic tasks, data science workflows (e.g. pandas, sklearn), multi-programming language, data science expert tasks
Multimodal Reasoning MMMU, MathVista Text + image reasoning in STEM and professional domains
Tool Use & Agent Behavior ToolBench, WebArena Real-world agent performance using tools, APIs, browsers
Specific Tool Benchmarks SpreadSheetBench,BrowseComp Benchmarks specific for common tools like Browser, Excel, ..
Safety & Trustworthiness TruthfulQA, ToxiGen, HHEM Hallucination resistance, ethical alignment, toxicity filtering
Efficiency & Environmental Impact CO2-Cost Energy usage and carbon emissions of training/inference pipelines
Vision Models (LVLMs) & Multimodality VLMEvalKit, MMMU, VizWiz Evaluation of Models that can deal with Image and Text

Leaderboards

1.Popular Leaderboards

2. Open Model Benchmarks:

3. Multimodal & Vision Model:

4. Function-Calling:

5. Code:

6. Embedding Models:

7. Safety & Trustworthiness / Censored Models:

8. Cost (Model Provider Cost):

9. Hardware Performance:

9. Hallucination:

Links