LLM Benchmarks

Major Benchmarks and What They Test

Terminal-Bench 2.0: Tests whether a model can actually operate in a real terminal environment: writing shell scripts, manipulating files, running CLI tools, chaining commands. It's not just "write code that looks right" but "does this actually run and work in a shell".
SWE-bench Verified: Drop the model into real GitHub repositories to fix actual bugs. The model has to navigate unfamiliar codebases, understand the issue context, and produce a working patch. "Verified" means a human engineer has curated the task set for quality. Considered the most realistic coding benchmark right now.
OSWorld-Verified: Tests computer use as an agent - the model has to interact with a real desktop GUI: clicking buttons, opening apps, filling forms, dragging files. Measures whether AI can literally operate a computer like a human would.
t2-bench: Tests agentic tool use in real business scenarios, specifically Retail and Telecom domains.The model has to call the right tools in the right sequence to complete multi-step workflows like checking inventory, processing orders, handling customer requests. The Telecom scores tend to be higher because telecom queries are more structured.
MCP-Atlas: Tests "scaled tool use," meaning the model has access to a large number of tools (think hundreds of MCP tools) and must pick the right ones and combine them correctly. Very relevant to the current MCP ecosystem where models are being connected to dozens of services at once.
BrowseComp: Tests agentic web search. The model has to browse the web autonomously across multiple steps to answer complex research questions — not a single Google search, but a full multi-hop investigation.
HLE (Humanity's Last Exam): Probably the hardest general reasoning benchmark right now. Questions are written by PhD-level domain experts and designed so that even Googling can't easily find the answer. Tested both without tools (pure reasoning) and with tools (model can search). Even Opus 4.6 only hits 53% with tools, meaning this benchmark is far from saturated — there's still a huge gap to close.
Finance Agent v1.1: Evaluates financial analysis as an agentic task: reading earnings reports, calculating metrics like EBITDA and P/E ratios, interpreting SEC filings, making investment judgments. Combines tool use with financial domain knowledge.
GDPval-AA Elo: Rather than a percentage score, this uses an Elo rating system to measure real-world work output quality across 44 different professions — legal briefs, engineering specs, slide decks, etc. The idea is to estimate how much economic value the model produces.
ARC-AGI-2: Tests novel problem-solving and inductive reasoning. The model is shown a few visual grid pattern examples and must figure out the underlying rule, then apply it to a new case it's never seen. There's no way to memorize answers — you have to genuinely generalize. This is specifically designed to resist benchmark contamination and measure something closer to actual intelligence.
GPQA Diamond: Graduate-level Google-Proof Q&A. PhD-level science questions in physics, chemistry, and biology where even domain experts without access to their notes score around 65%. "Diamond" is the hardest tier.
MMMU-Pro: Multimodal multi-discipline university-level exam questions that require understanding both images and text together — charts, diagrams, lab results. Tested with and without tools.
MMMLU: The multilingual version of MMLU, where the same knowledge questions are asked across many languages. Tests whether a model's knowledge and reasoning hold up outside of English.