Which AI model is best?. Why it is wrong question to ask?

A few years ago, developers were asking about tech stacks and arguing which framework was the best. Now we’ve swapped those questions for debates about AI models.

But does this even make sense? Is “which model is the best” actually a good question?

Like always, the best answer is “it depends.” Of course, it’s pretty obvious now that some models perform better for different tasks. GPT-X.X-Codex models are optimized for agentic coding and will outperform GPTs without the “codex” suffix as coding assistants (at least by definition).

However, when reading X, LinkedIn, or Reddit discussions on differences between models, I’ve noticed that some people say Opus 4.6 is the best for coding, while others argue that GPT-5.3-Codex outperforms it.

Why it’s hard to compare LLMs?

And to be honest, it’s incredibly hard to compare LLMs. Why is that?

Non-deterministic nature

LLMs are inherently non-deterministic and operate based on statistics. It’s in their nature that every X questions asked, they’ll answer with hallucinations or something that doesn’t make sense. So to test a model properly, you need to run it at a bigger scale so you can remove those outliers (nonsense answers) - assuming you’re able to do so.

The output is generated text, so it’s hard to verify, compare, or rate.

Even reproducing tests won’t work well enough here. If the question isn’t closed (True/False questions, ABCD answers, etc.), it’s really difficult to compare and rate the results fully objectively, especially if the task is complex.

Of course, you can create a task like “Implement algorithm XYZ, focus on performance” and measure the performance of each implementation. However, it’s never all about performance. Code should also be easy to read, easy to maintain, etc. Even though you specified that you want focus on performance, you should keep in mind that LLMs have extensive knowledge built in, and during their thinking process, they’ll sometimes decide to take care of not only performance but also readability.

”External” influence

Most AI power users are using LLMs through some kind of application:

online chat interface
coding agent app
other wrappers

And all of those apps can have sets of custom or predefined instructions, system prompts, or different ways of combining messages together.

You can’t compare how Claude 4.6 Opus will work with your freshly installed Claude Code on a new project to Claude 4.6 Opus running on a project with a well-defined AGENTS.md/CLAUDE.md file, system prompts, and connected MCPs.

You should always cooperate with the model, not fight with it

I’m coding now with Claude Code using Sonnet 4.5 / Opus 4.6. If the AI doesn’t deliver satisfying enough results, I’ll try:

changing the prompt
giving more details
shifting perspective

I won’t instantly jump to another model believing that I did everything perfectly and it’s only the LLM’s fault. With that mindset, I’d quickly run out of models.

And if you’re working with a specific setup for a longer time, you can’t measure the difference - you can only feel it. And you’ll never be sure if it’s the model working worse or if you just gave it tasks that were too hard.

How to compare LLMs?

So if subjective experience is unreliable and casual testing doesn’t cut it, how do we actually compare models?

There are a few approaches that work. Let me break them down.

Benchmarks - the “official” way

The AI industry has standardized tests for LLMs. Think of them as exam scores.

The most important ones to know:

MMLU-Pro for general knowledge (57 academic fields, 10-option questions). SWE-bench Verified for coding (models solve real GitHub issues from actual repos). MATH/AIME for mathematical reasoning (high school to olympiad level). ARC-AGI for abstract reasoning (still brutally hard for every model out there).

Some models have likely “seen” benchmark data during training, which inflates scores. A model can ace standardized tests while struggling with your specific real-world tasks. And once every model hits 95%+ on a benchmark, the test becomes meaningless for comparison.

Chatbot Arena - the crowd-sourced approach

This one deserves its own section because it’s arguably the most trustworthy comparison method we have.

Chatbot Arena (arena.ai) works like this: you get responses from two anonymous models side by side, and you pick the better one. No brand bias. No preconceptions. Just raw output quality judged by thousands of real users.

The results are ranked using an ELO system (similar to chess ratings), built on hundreds of thousands of blind comparisons. It’s not perfect, but it’s the closest thing we have to an objective, human-validated ranking.

Testing on your own use cases - the practical way

This is what I’d actually recommend, especially if you’re choosing a model for production work or daily use.

Create a set of 20-30 prompts that represent your real tasks. For me, that would be Python scripting, content drafting, and code review. For you, it might be something completely different. Run them through a few models and compare the outputs.

Yes, it takes time. But it will tell you more about which model fits YOUR workflow than any leaderboard ever could.

Where to track the results

If you want to stay up to date with model comparisons:

arena.ai for the live Chatbot Arena rankings. artificialanalysis.ai for side-by-side comparison of pricing, latency, and quality. The Open LLM Leaderboard on Hugging Face for open-source model rankings. Openrouter.ai is great for comparing model when it comes to pricing from different providers.

The bottom line

Use benchmarks and leaderboards as a filter to narrow down your options. Then test the shortlisted models on your actual work. That combination of data-driven filtering and hands-on testing is the most reliable way to find the right model for your needs.

Because at the end of the day, the “best” model is the one that works best for what you’re building.

Kamil Kwapisz