What is your preferred site to see AI scores on different AI tests?

I’m looking for a website that ranks AI models on different tasks like OCR, researching, etc. I’m not looking for a one-time article that will never be updated. I’m looking for a website that is updated frequently enough.

What is your favorite website for this?

AICodeKing on Youtube seems to use a website where a person can sort on any column on the web page but I could not read the URL in the video.

1 Like

I think many people check leaderboards for that purpose.

Also, since the leaderboard essentially ranks models based on benchmarks, it isn’t particularly well-suited for models specialized in narrow tasks, so it’s safer to use other channels as well. (On HF, this includes Posts, Blog, Hub Models, Spaces, etc.)


The public favorites are not one site. They cluster into two general-purpose sites and several specialist leaderboards.

The two names that come up most often are Artificial Analysis and LLM Stats. In a recent discussion asking “What LLM benchmarking sites do you use?”, one person explicitly said they use Artificial Analysis because it combines results from multiple benchmarks, while another commenter listed LiveBench, Artificial Analysis, SWE-bench, ContextArena, and others as their regular set. An AI news roundup also explicitly grouped artificialanalysis.ai and llm-stats.com together as recommended benchmarking resources. (Reddit)

The shortest answer

If you want one site to browse first, I would point you to LLM Stats.
If you want one site to trust more for methodology, I would point you to Artificial Analysis.
If you want human preference rankings, use LM Arena.
If you want freshness and anti-contamination, use LiveBench.
If you care about OCR or research agents, jump to OCRBench v2 and DeepResearch Bench / GAIA instead of stopping at a general leaderboard. (LLM Stats)

What people seem to like most, and why

1. Artificial Analysis

This is the closest thing to a serious all-purpose benchmark dashboard. Its homepage says it provides independent analysis of AI and compares models on intelligence, speed, and price, plus provider performance. Its methodology page says its Artificial Analysis Intelligence Index v4.0.2 combines 10 evaluations across reasoning, knowledge, math, and programming, and it explicitly says the suite is text-only and English-only, with image, speech, and multilingual performance benchmarked separately. That methodological clarity is a big reason people trust it more than random scoreboards. (Artificial Analysis)

Why people like it:

  • It is broad.
  • It explains how its composite score is built.
  • It includes practical dimensions like speed and price, not just benchmark scores. (Artificial Analysis)

Why it is not enough alone:

  • Its flagship Intelligence Index is not your OCR leaderboard.
  • It is not your deep-research-agent leaderboard.
  • It is a good overview, but it still has to be paired with specialist benchmarks for your exact task. (Artificial Analysis)

2. LLM Stats

This is the strongest public match to your description of “a sortable site where a person can sort any column on the web page.” Its homepage says it is updated Mar 27, tracks 275 models, and exposes a big leaderboard with columns like Code, Arena, GPQA, SWE-bench, Context, and input/output pricing. It also has separate areas for leaderboards, benchmarks, compare pages, arenas, and news. That makes it very good as a fast scanning and comparison site. (LLM Stats)

Why people like it:

  • It is dense and practical.
  • It is easy to scan.
  • It mixes benchmark scores with cost and context window info. (LLM Stats)

Why it is not enough alone:

  • It is more of an aggregator and comparison hub than a single deep evaluation methodology.
  • It is excellent for “what should I look at next?” and weaker for “which specialist benchmark is the right ground truth for OCR or research?” (LLM Stats)

The likely site used in those AICodeKing videos

The strongest public guess is LLM Stats. I cannot verify the exact video URL from here, but among current public sites, LLM Stats matches your description best because it has the sortable many-column leaderboard layout and broad comparison surface. Artificial Analysis is also possible, but its interface is more analysis-centric than “big sortable comparison grid.” (LLM Stats)

The other sites people keep using

3. LM Arena

LM Arena is different from the others. It is not mainly a static benchmark site. Its FAQ says user votes shape rankings through the Bradley-Terry rating system, and the models stay anonymous during voting until after the vote. Business Insider’s 2025 interview with LM Arena’s CTO says the site had grown to over 3 million monthly users and supports rankings across text, coding, vision, and image-generation-related experiences. (Aurea)

Why people like it:

  • It reflects human preference, not just benchmark math.
  • It is good for “which answer feels better in practice?” (Aurea)

Why people distrust it:

  • Community discussions say it can reward verbosity or style rather than the exact capability you care about.
  • Even supporters tend to treat it as one signal, not the only one. (Reddit)

4. LiveBench

LiveBench is one of the main answers from people who care about stale benchmarks and contamination. In that recent community thread, one commenter included LiveBench among their go-to sites. The official site says it updates questions regularly so the benchmark completely refreshes every 6 months, while the project materials say questions are added and updated monthly and scored against objective ground truth rather than an LLM judge. (Reddit)

Why people like it:

  • It is fresher than old static leaderboards.
  • It is explicitly designed around contamination limits.
  • It covers broad categories like math, coding, reasoning, language, instruction following, and data analysis. (GitHub)

Why it is not enough alone:

  • It is a benchmark, not a giant comparison dashboard.
  • It is strong for “what still has signal,” weaker for “show me everything in one sortable interface.” (LiveBench)

For OCR and research, people should not stop at general leaderboards

Your examples matter. You mentioned OCR and researching. Those are specialized enough that a general site can mislead you.

OCR: OCRBench v2

OCRBench v2’s official site says it aims to update every quarter. Its paper says it is a large-scale bilingual text-centric benchmark with 4× more tasks than the earlier OCRBench, 31 scenarios, 10,000 human-verified QA pairs, and a private test set with 1,500 manually annotated images. That is far closer to real OCR evaluation than a generic “best model overall” chart. (99Franklin)

Researching: DeepResearch Bench and GAIA

DeepResearch Bench’s official site says it consists of 100 PhD-level research tasks spanning 22 distinct fields, built with 100+ domain experts. GAIA’s official leaderboard says it is a benchmark for general AI assistants that require reasoning, multimodality, web browsing, and tool use, with 450 questions and three difficulty levels. If what you mean by “researching” is “can it browse, synthesize, and deliver a useful report,” these are more on-point than a generic chat leaderboard. (DeepResearch Bench)

What the internet consensus looks like

The recurring pattern in public discussions is this:

  • No single leaderboard captures everything.
  • People mix multiple sites.
  • They often use one broad site, one freshness-oriented benchmark, and one specialist benchmark. In the recent “most reliable benchmarking site” discussion, one commenter explicitly said no single benchmark captures everything and that different leaderboards measure different aspects of capability. In the “what benchmarking sites do you use?” thread, people named a mix of UGI, LiveBench, Artificial Analysis, SWE-bench, EQBench, and others rather than rallying around only one site. (Reddit)

That is the background. The web does not seem to have one universally accepted favorite. It has a few recurring favorites for different jobs. (Reddit)

What I would recommend for your exact use case

If your real goal is:

“I want one page that stays updated and lets me sort models quickly.”

Use LLM Stats first. It best matches the UI you described and it is actively updated. (LLM Stats)

“I want one page that is more rigorous and easier to trust.”

Use Artificial Analysis first. It has clearer methodology and a stronger “independent analysis” framing. (Artificial Analysis)

“I want to know what real people prefer in side-by-side use.”

Use LM Arena. (Aurea)

“I want benchmark signal that is fresher and harder to game.”

Use LiveBench. (LiveBench)

“I care specifically about OCR.”

Use OCRBench v2 before trusting any overall ranking. (99Franklin)

“I care specifically about research agents.”

Use DeepResearch Bench and GAIA before trusting any overall ranking. (DeepResearch Bench)

My practical ranking

For a normal person trying to keep up without drowning:

  1. LLM Stats for daily browsing and sortable comparisons. (LLM Stats)
  2. Artificial Analysis for more serious comparison and methodology. (Artificial Analysis)
  3. LM Arena for human-preference sanity checks. (Aurea)
  4. LiveBench when you care about benchmark freshness. (LiveBench)
  5. OCRBench v2 or DeepResearch Bench / GAIA when the task is specialized. (99Franklin)

Bottom line

If you want the cleanest answer to “what are people’s favorite websites for this,” the recurring favorites are:

  • Artificial Analysis
  • LLM Stats
  • LM Arena
  • LiveBench

And for your specific examples:

  • OCRBench v2 for OCR
  • DeepResearch Bench and GAIA for research-style agent work. (Reddit)

If I had to guess the site from the video, I would bet on LLM Stats first. (LLM Stats)

1 Like