Model Benchmarks & Leaderboards

Mar 2025

Adopt

Model benchmarks and leaderboards are available for various evaluation dimensions relevant to model choosing and discovery.

In this article we will provide an overview of the most relevant benchmarks and leaderboards, as well as a summary of model limitations that one should have in mind.

Performance & Benchmarks (MMLU etc.):

As large language models (LLMs) and generative AI (GenAI) are rapidly advancing, robust and targeted benchmarks are essential to assess their real-world capabilities, domain-specific expertise, reasoning power, thrustworthiness and environmental impact. The following benchmarks represent a selection of most relevant benchmarks used to evaluate the next generation of AI models.

Overview Important Benchmarks:

Category	Benchmark(s)	Purpose & Focus Area
General Knowledge & Reasoning	MMLU-Pro, GPQA, BIG-Bench Hard, AGIEval, Humanity’s Last Exam	Measures domain-specific and general expertise (law, physics, medicine) at different difficulty.
Mathematical Reasoning	Math (OpenAI), MathVista, FrontierMath	Symbolic and numerical reasoning, including chain-of-thought and visual math. Mathematic task solving.
Instruction Following & Multi-Hop Logic	IFEval, MUSR, LongBench	Evaluates complex instruction following, task planning, and stepwise logic
Data Analytics & Querying	BIRD, DataSciBench, Spider 2.0	SQL generation, structured data understanding, table-to-text generation
Coding, Software & Data Science	LiveCodeBench, DS-1000, CodeContests, MultiPL-E, DSBench	Functional correctness of code generation, Complex algorithmic tasks, data science workflows (e.g. pandas, sklearn), multi-programming language, data science expert tasks
Multimodal Reasoning	MMMU, MathVista	Text + image reasoning in STEM and professional domains
Tool Use & Agent Behavior	ToolBench, WebArena	Real-world agent performance using tools, APIs, browsers
Specific Tool Benchmarks	SpreadSheetBench,BrowseComp	Benchmarks specific for common tools like Browser, Excel, ..
Safety & Trustworthiness	TruthfulQA, ToxiGen, HHEM	Hallucination resistance, ethical alignment, toxicity filtering
Efficiency & Environmental Impact	CO2-Cost	Energy usage and carbon emissions of training/inference pipelines
Vision Models (LVLMs) & Multimodality	VLMEvalKit, MMMU, VizWiz	Evaluation of Models that can deal with Image and Text