Model benchmarks and leaderboards are available for various evaluation dimensions relevant to model choosing and discovery.
In this article we will provide an overview of the most relevant benchmarks and leaderboards, as well as a summary of model limitations that one should have in mind.
Performance & Benchmarks (MMLU etc.):
As large language models (LLMs) and generative AI (GenAI) are rapidly advancing, robust and targeted benchmarks are essential to assess their real-world capabilities, domain-specific expertise, reasoning power, thrustworthiness and environmental impact. The following benchmarks represent a selection of most relevant benchmarks used to evaluate the next generation of AI models.
Overview Important Benchmarks:
Category | Benchmark(s) | Purpose & Focus Area |
---|---|---|
General Knowledge & Reasoning | MMLU-Pro, GPQA, BIG-Bench Hard, AGIEval, Humanity’s Last Exam | Measures domain-specific and general expertise (law, physics, medicine) at different difficulty. |
Mathematical Reasoning | Math (OpenAI), MathVista, FrontierMath | Symbolic and numerical reasoning, including chain-of-thought and visual math. Mathematic task solving. |
Instruction Following & Multi-Hop Logic | IFEval, MUSR, LongBench | Evaluates complex instruction following, task planning, and stepwise logic |
Data Analytics & Querying | BIRD, DataSciBench, Spider 2.0 | SQL generation, structured data understanding, table-to-text generation |
Coding, Software & Data Science | LiveCodeBench, DS-1000, CodeContests, MultiPL-E, DSBench | Functional correctness of code generation, Complex algorithmic tasks, data science workflows (e.g. pandas, sklearn), multi-programming language, data science expert tasks |
Multimodal Reasoning | MMMU, MathVista | Text + image reasoning in STEM and professional domains |
Tool Use & Agent Behavior | ToolBench, WebArena | Real-world agent performance using tools, APIs, browsers |
Specific Tool Benchmarks | SpreadSheetBench,BrowseComp | Benchmarks specific for common tools like Browser, Excel, .. |
Safety & Trustworthiness | TruthfulQA, ToxiGen, HHEM | Hallucination resistance, ethical alignment, toxicity filtering |
Efficiency & Environmental Impact | CO2-Cost | Energy usage and carbon emissions of training/inference pipelines |
Vision Models (LVLMs) & Multimodality | VLMEvalKit, MMMU, VizWiz | Evaluation of Models that can deal with Image and Text |
Leaderboards
1.Popular Leaderboards
- Chatbot Arena LLM Leaderboard: Community-driven Evaluation for Best LLM and AI chatbots
- Artificial Analysis Leaderboards
- Hugging Face List of Leaderboards on the Hub & Model Catalogs in Hugging Face
- Multilingual MMLU
2. Open Model Benchmarks:
3. Multimodal & Vision Model:
- OpenVLM Leaderboard of VLMEvalKit Evaluations
- MMMU Leaderboard - Understanding and Reasoning Benchmark for Expert AGI
- TTS (Text to Speech) natural sounding Arena
4. Function-Calling:
5. Code:
6. Embedding Models:
7. Safety & Trustworthiness / Censored Models:
8. Cost (Model Provider Cost):
9. Hardware Performance:
9. Hallucination:
Links
- For a broader overview of model discovery, see Model Discovery.