Meta adopts DeepSeek’s efficiency playbook for Llama 4

Facebook-parent Meta has used the same method used by Chinese AI startup DeepSeek to make its new AI model cohort, Llama 4, more efficient. Meta announced Saturday that Llama 4 is its first model to use a mixture-of-experts (MoE) architecture.

Faced with US trade sanctions that had cut access to advanced AI GPUs, DeepSeek used MoE to optimize its R1 and V3 models at a fraction of the cost of some of the prominent large language models (LLMs). Prior to it, splashing billions to buy the most powerful GPUs was seen as the quickest way to achieve AI breakthroughs.

Inference is the stage when users interact with the model, and that takes up a lot of computational power.

Models built on MoE architecture divide the total parameters across a group of experts based on specific common tasks. So, instead of activating all of the parameters at all times for every problem, the model uses a gating system to determine which expert within the network is best suited to handle the specific query and only activates a smaller set of parameters.

This significantly reduces computational costs and improves efficiency during inference. It is one of the reasons why DeepSeek’s cost of building and running the model has been so low.

According to Meta, Llama 4’s new general assistant and chat model, Maverick, has 400 billion total parameters, but only 17 billion parameters are active across 128 experts. Similarly, Llama 4’s new document summarization model, Scout, has 109 billion parameters, but only 17 billion of them are active across 16 experts. The third model, Behemoth, has only 288 billion active parameters across 16 experts out of the total 2 trillion total parameters.

Meta claims that Scout can run on a single Nvidia H100 GPU, while Maverick requires an Nvidia H100 DGX system.

DeepSeek R1 was reportedly built at a cost of $6 million and uses 2,000 Nvidia H800 GPUs, which is a fraction of the investment and AI GPUs used by OpenAI’s GPT-4 model. Yet, R1’s benchmark performance in several areas including coding and maths was better than some of the most advanced models including Llama 3.1, Anthropic’s Claude-3.5, and GPT-4o.

Within days of its release in mid-January, DeepSeek app climbed to the top of the US app marketplaces with millions of downloads on both Google Play Store and Apple App Store. Its success also sent Nvidia’s stock plummeting, erasing over $500 billion of its market capitalization, fueled by concerns that the substantial investments in its costly AI hardware might be unnecessary.

Most of the popular AI models have been built using vast amounts of computing power, involving powerful GPUs and specialized infrastructure. In 2023, OpenAI CEO, Sam Altman, said at an MIT event that the cost of training GPT-4 was more than $100 million. Their latest models are estimated to be built at a much higher cost.

Though Nvidia hasn’t made the price of its flagship AI chip H800 public, it is estimated to cost somewhere around $30,000. The new Blackwell chips are expected to cost even more.

According to Arc Prize Foundation, which maintains and administers ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence), the cost of solving a single ARC-AGI problem by OpenAI’s o3 model is somewhere around $30,000. The foundation’s earlier estimate was $3,000. ARC-AGI is a benchmark to assess an AI model’s ability to perform abstract reasoning and problem-solving.

To make up for the heavy expenditure in the AI models, OpenAI is planning to charge up to $20,000 per month for its specialized AI agents, which can undertake tasks such as browsing the web, booking tickets, and ordering food among others, on behalf of humans.

In January, OpenAI signed a $500 billion investment deal with SoftBank and Oracle to build AI infrastructure in the US.

Scout and Maverick are available on Llama.com and Hugging Face, while Meta AI, the AI chatbot on WhatsApp, Instagram and Messenger, has also been updated to incorporate the new models.

Image credit: WhatsApp screenshot