The Great AI Race: Which LLM Model Reigns Supreme?

giorgio sbriglia
Nov 12, 2024
7 min read

Updated: Nov 21, 2024

What is the best AI model in 2024?

Intro: The AI Leaderboard Before the Storm: Best Models by Category

The race to develop the most powerful and efficient large language models (LLMs) is a dynamic and ever-evolving competition. This analysis provides a snapshot of the current leaders across key categories – quality, speed, and price – offering valuable insights for developers, researchers, and businesses seeking to harness the power of AI. It's crucial to remember that this leaderboard represents a "pre-Blackwell" world, as Nvidia's upcoming Blackwell architecture promises to revolutionize AI capabilities.

This analysis focuses on publicly available benchmarks of leading LLMs from major players like Google, OpenAI, and Anthropic available on artificialanalysis.ai, one of the many leader boards available online for LLMs. API providers such as Groq, Sambanova, Cerebras, and DeepSeek, who offer specialized inference services, will be covered in a dedicated post. We also acknowledge the absence of xAI's Grok from this analysis, as its capabilities remain to be fully evaluated and benchmarked.

The Most Accurate AI LLM Model in 2024

The field of AI is exploding, with new models and advancements emerging constantly. It can be tough to keep track, let alone determine which model is truly the "best." This graphic, based on an Artificial Analysis Quality Index, gives us a snapshot of the current leaders

Bar chart of the Artificial Analysis Quality Index comparing AI models, showing GPT-4, Claude, and Gemini among top performers with scores in the 80s, and lower scores for models like Llama 3 7B. — Artificial Analysis Quality Index: Performance comparison of leading AI models like GPT-4, Claude, Gemini, and Llama based on quality scores. https://artificialanalysis.ai/

The AI landscape is constantly evolving, and the latest rankings hint at a possible shake-up at the top. While Anthropic's Claude 3.5 previously held a strong position, even outperforming GPT-4o, OpenAI has surged back into the lead with its new o1-preview and o1-mini models, which now top the quality rankings. This advantage is largely due to o1’s enhanced “reasoning capabilities.” Indeed, other top models lacking this advanced reasoning—such as Claude 3.5 Sonnet and Gemini 1.5—are rated similarly at 80.

However, it’s worth noting that this edge in quality might come at a cost, as achieving superior reasoning capabilities often requires greater resources. This may explain why open-source models like Llama 3.1 and Mistral Large 2, which prioritize accessibility and flexibility, are falling behind in quality scores. In this fast-paced race, the balance between quality and cost-efficiency continues to shape the landscape

What's the AAQI (Artificial Analysis) Quality Index ?

The Quality Index is a streamlined metric designed to help users quickly assess and compare the relative performance of various AI models. By integrating key evaluation scores—specifically the Chatbot Arena Elo Score, MMLU (Massive Multitask Language Understanding), and MT Bench—the Quality Index provides a single, normalized score that reflects a model's overall effectiveness. Each of these components measures a distinct aspect of model performance:

The Elo Score reflects user interaction quality,
MMLU evaluates multitask learning capabilities across diverse fields,
and MT Bench assesses model aptitude in complex language tasks.

These values are standardized and combined to create a unified Quality Index score, making it easier to compare models at a glance. The Quality Index should be used for relative comparisons rather than as an absolute measure; it offers a high-level view rather than detailed, model-specific insight, and it’s best to avoid quoting the scores as standalone indicators of model quality.

The fastest AI LLM Model in 2024

Need for Speed: Crowning the Fastest AI Model

In the ever-evolving world of large language models (LLMs), speed is a critical factor. Faster response times mean smoother conversations, quicker task completion, and ultimately, a more satisfying user experience. But which model reigns supreme in this race for speed?

This latest benchmark, measuring output tokens per second, provides some illuminating answers:

Bar chart comparing AI model speeds by output tokens per second, showing Google’s Gemini 1.5 Flash as the fastest at 260 tokens per second, followed by OpenAI's GPT-4 mini at 102, GPT-4.0 at 83, and other models. Faster speeds indicate higher performance in rapid response generation. — AI Model Speed Comparison – Google Gemini 1.5 Flash Leads with 260 Tokens per Second

Gemini 1.5 Flash Takes the Lead

With a blazing speed of 260 tokens per second, Google's Gemini 1.5 Flash claims the top spot. This specialized version of Gemini prioritizes rapid response times, making it ideal for applications that demand real-time interaction, such as chatbots and interactive storytelling.

O1 Preview and o1-mini are indeed paying the price for that extra quality performance. This is typically experienced when we are using O1 models; we are served with messages like the one below. (I asked for a cooking recipe but was instead encouraged to wait in the quest of advancing AI!) Interestingly, 4o, which ranks fourth in quality, ensures that OpenAI secures a podium finish in the fastest LLM competition as well.

Advancing AI - loading text on Chatgpt waiting for "Reasoning" — Advancing AI - loading text on Chatgpt O1 Preview waiting for "Reasoning"

One final remark to mention, is that how a model is deployed will significantly affect its speed. There is a whole chapter to open regarding this, but typically it's better covered in a dedicated post more specific about API service providers, where different API providers compete on the same LLM as well.

The Most Affordable AI LLM Model in 2024

The Price of Language: Finding the Most Affordable AI

Large Language Models (LLMs) are revolutionizing how we interact with technology, but their advanced capabilities often come at a significant cost. For developers and businesses seeking to integrate these powerful tools, finding the most cost-effective option is crucial.

This chart breaks down the price per 1 million tokens for some of the leading LLMs, shedding light on the most affordable contenders:

Bar chart showing cost per million tokens for various AI models, with Google’s Gemini 1.5 Flash being the most affordable at $0.1, followed by GPT-4 mini at $0.3, while OpenAI’s o1-preview is the most expensive at $26.3. Lower costs reflect more budget-friendly AI options. — AI Model Cost Comparison – Google Gemini 1.5 Flash and GPT-4 mini Offer Lowest Costs per Million Tokens

Gemini 1.5 Flash Shines Again:

Not only is Gemini 1.5 Flash the fastest LLM (as seen in the previous speed benchmark), but it also boasts the lowest cost at just $0.1 per 1 million tokens. This makes it an incredibly attractive option for applications where both speed and affordability are paramount.

Open-Source Models Offer Value:

Open-source models like Mistral Large 2 and Llama 3.1 70B also demonstrate strong affordability, priced at $0.8 and $3 per 1 million tokens, respectively. This accessibility empowers developers and researchers with limited budgets to experiment with and leverage the power of LLMs. Note that these values depend only on the infrastructure provider so will be covered more in the API providers post.

The Cost of Quality:

While OpenAI's o1 models lead the pack in terms of quality, they come with a premium price tag. o1-preview, in particular, stands out as the most expensive model at $26.3 per 1 million tokens. This highlights the trade-off between performance and affordability, where cutting-edge capabilities often come at a higher cost.

A Biased Note

Bar chart displaying AI models' coding accuracy based on the HumanEval benchmark. OpenAI’s o1-preview ranks highest with 95% accuracy, followed closely by o1-mini at 93%, Claude 3.5 Sonnet at 93%, and GPT-4 at 90%. Lower accuracy scores include Llama 3.7B at 64%. Higher percentages indicate better coding performance. — AI Model Coding Performance on HumanEval Benchmark – OpenAI’s o1-preview Leads with 95% Accuracy

Anthropic's Claude: The Developer's Darling (Despite Not Winning Any Races)

Okay, let's be honest, Anthropic's Claude might not be snatching gold medals in any of our specific categories just yet. But let's give credit where credit is due!

Claude consistently ranks well across the board, showcasing a solid balance of quality, speed, and cost. However, what truly catapulted Claude into the hearts of developers are its coding ability, better represented with Aider’s code editing benchmark and its innovative "artifact" functionality.

The Aider code editing benchmark measures GPT models' ability to write and fix Python code. It includes 133 exercises where models must read instructions, edit code, and pass unit tests. If the first attempt fails, models get error feedback for a second try. Success is based on the percentage of tasks completed correctly, highlighting each model's coding and debugging skills.

Model	Percent completed correctly	Percent using correct edit format	Command	Edit format
claude-3-5-sonnet-20241022	84.2%	99.2%	aider --model anthropic/claude-3-5-sonnet-20241022	diff
o1-preview	79.7%	93.2%	aider --model o1-preview	diff
claude-3.5-sonnet-20240620	77.4%	99.2%	aider --model claude-3.5-sonnet-20240620	diff
claude-3-5-haiku-20241022	75.2%	95.5%	aider --model anthropic/claude-3-5-haiku-20241022	diff

Extract from https://aider.chat/docs/leaderboards/

The "artifact" feature allows Claude to generate and manage complex outputs like code snippets, diagrams, and interactive elements in a dedicated space, separate from the main conversation. This has proven invaluable for developers, enabling seamless iteration, collaboration, and integration with existing workflows. It seems that sometimes, winning users' affection relies on more than just raw performance metrics. Who knew user experience could be such a game-changer? (We did, but let's let Anthropic have this moment).

A Battle of Titans

a battle of tech giants: colossal robots fighting one another standing out in between new york sky scraper. — AI generated Image. Prompt: a battle of tech giants: colossal robots fighting one another in between new york sky scrapers

The race for AI dominance is shaping up as a battle among industry titans, each fueled by the resources and ambitions of Nasdaq leaders. Companies like Google, OpenAI (with significant backing from Microsoft), and Anthropic (with substantial investment from Amazon) are leading the way, harnessing vast computational power provided by robust cloud infrastructure to develop and refine their models. Microsoft, Amazon, Google, and Meta dominate the global cloud infrastructure, offering the foundation upon which these cutting-edge AI models run. With Google’s formidable TPU-powered cloud, Amazon’s AWS being a go-to for scalable AI applications, Microsoft’s Azure backing OpenAI, and Meta’s data powerhouse capabilities, each player is leveraging their cloud expertise to push the boundaries of AI.

Even seemingly "independent" players like Meta, which has developed its own Llama models, are still tech giants. This concentration of cloud and AI power within a few major corporations raises crucial questions about the future of AI, its accessibility, and the potential for monopolization. Could the dominance of such a small group of players lead to barriers for new entrants or restrict accessibility to these powerful tools?

Amidst these giants, a smaller contender, Mistral AI, is making waves with an approach that could disrupt the landscape. As a French startup, Mistral focuses on efficient design and open-weight models, aiming to deliver powerful AI capabilities with reduced computational costs. Their strategy offers a glimpse of how AI could be made more accessible to a wider range of users and applications. Mistral’s European roots further add diversity to this global competition, providing a refreshing alternative to the US-based giants dominating the field.

The Looming Shadow of Blackwell: A Snapshot of AI's Pre-Revolution Landscape

Nvidia's Blackwell is poised to redefine the AI landscape, potentially reshuffling the leaderboard with its unparalleled capabilities. It's important to remember that the models analyzed here were all trained using the Hopper infrastructure (H100, H200). However, Nvidia's new Blackwell platform promises a significant leap forward in AI performance.

With its groundbreaking architecture and advancements like the GB200 NVL72 system, Blackwell offers:

Massive Performance Boost: Up to a 30x performance increase compared to the same number of H100 GPUs for LLM inference workloads.
Enhanced Efficiency: Reduced cost and energy consumption by up to 25x.
Unprecedented Scale: Support for real-time trillion-parameter models, opening doors to new possibilities in AI capabilities.

This means that future iterations of LLMs, running on Blackwell, will achieve significantly higher speeds, lower costs, and surpass the quality benchmarks set by current leaders. Therefore, while this analysis provides a valuable snapshot of the current AI landscape, it's crucial to recognize that it represents a "pre-Blackwell" era. The arrival of Blackwell could usher in a new wave of innovation and competition, potentially reshaping the hierarchy and redefining the boundaries of AI capabilities.

In essence, consider this the 2024 world ranking for AI, pre-Blackwell phase. The game is about to change, and the true potential of these models will be unlocked once they harness the power of Blackwell.

Long-term wise, the AI dominance will be driven by infrastructure and capital, like in any industrial revolution. Back in the day, roads, bridges, railroads, factories and steam generation were the backbone of an industrial power, today long term supremacy will be driven by fiber, network, AI data centers / Super Computers and power generation.

Home

About

Services

Glossary

Blog

Careers

Reserve

Contact