LLAMA 4 in 9 Minutes - Developers Digest

About this video

Learn The Fundamentals Of Becoming An AI Engineer On Scrimba; https://scrimba.com/the-ai-engineer-path-c02v?via=developersdigest Meta just released LLAMA 4, and the specs are truly groundbreaking! This video breaks down everything you need to know about these revolutionary new models: • LLAMA 4 Scout: 17B active parameters with an unprecedented 10M token context window (equivalent to 7,500+ pages of text!) • LLAMA 4 Maverick: 400B parameters with native multimodal capabilities and 1M token context • LLAMA 4 Behemoth: A staggering 2T parameters (still training) that outperforms GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Pro on key benchmarks Learn about the game-changing Mixture of Experts architecture that makes these models more efficient and affordable. I'll show you how to access these models, their impressive benchmark results, and why LLAMA 4 Scout's 10M context window opens up incredible new possibilities for document processing, code analysis, and personalization. Plus, get the latest on Grok integration and competitive pricing that makes these frontier models accessible to developers and businesses. TIMESTAMPS: 00:00 Introduction to LLAMA 4 00:26 Overview of LLAMA 4 Models 00:38 LLAMA 4 Scout: Industry-Leading Context Window 01:31 LLAMA 4 Maverick: Multimodal Capabilities 01:58 LLAMA 4 Behemoth: The 2 Trillion Parameter Giant 02:23 Downloading and Accessing LLAMA 4 Models 02:33 Mixture of Experts Architecture 03:50 Performance and Benchmarks 05:07 LLAMA 4 Scout: Practical Applications 07:14 Future of LLAMA 4 Models 07:30 Access and Pricing 08:38 Conclusion and Call to Action LINKS: Official LLAMA Website: https://www.llama.com/ Try LLAMA 4 Scout on Groq: https://console.groq.com/playground?model=meta-llama/llama-4-scout-17b-16e-instruct Download LLAMA 4: https://www.llama.com/llama-downloads/?utm_source=llama-home-hero&utm_medium=llama-referral&utm_campaign=llama-utm&utm_offering=llama-downloads&utm_product=llama LM Arena Benchmarks: https://lmarena.ai/ Meta AI Announcement: https://x.com/AIatMeta/status/1908598456144531660

Transcript

--- type: transcript date: 2025-04-05 youtube_id: TIZmjmsBh20 --- # Transcript: LLAMA 4 in 9 Minutes The Llama 4 is now here and the numbers are absolutely staggering. We have 10 million tokens of context for Llama 4 Scout and nearly 2 trillion parameters for Llama 4 Behemoth. First up, just to give you a sense, Llama 4 Maverick, which is their 400 billion parameter model, we have this ranked second just shy of Gemini 2.5 Pro Experimental above GP40, Gro 3 Preview, as well as GP4.5. There's three new models. Llama 4 Behemoth, Llama 4 Maverick, as well as Llama 4 Scout. We have a two trillion, a 400 billion as well as a9 billion parameter model. Now, in terms of the context window, Llama for Scout has a 10 million context window, whereas Llama for Maverick has native multimodal capabilities with a million context length. Before this announcement, the only models that were able to crack that million tokens of context amongst the Frontier were the Gemini series of models, Gemini 2.5 Pro experimental as well as the Gemini Flash models. In terms of Llama 4 Scout, their smallest models, this has 17 billion active parameters with 16 experts. It's the best multimodal model in the world in its class, more powerful than all previous generations of models while fitting on a single H100 GPU. They mentioned that Llama for Scout offers an industry-leading context window of 10 million tokens of context and delivers better results than Gemma 3, Gemini 2.0 flashlight across a broad range of widely reported benchmarks. Now, in terms of Llama 4 Maverick, it's a 17 billion active parameter model with 128 experts and also it's the best multimodal model in its class. It beats GP40 as well as Gemini 2.0 O flash across a number of benchmarks. In terms of reasoning as well as coding, they mentioned that Llama for Maverick is on par with Deep Seek V3. They mentioned that Scout and Maverick are the best models yet thanks to distillation from Llama for Behemoth, which is the model with two trillion parameters and a staggering 288 billion active parameter with 16 experts. Now, one thing to note, they mentioned that Llama for Behemoth is still training, so they're excited to share more details. This isn't even the final model. So, there will be subsequent updates likely in a similar fashion to how they released the Llama 3 series of models where there was 3.1, 3.2, and then 3.3. To download these models, you can head to llama.com/lama-d downloads. You can fill out your information and you'll be able to download the Scout or Maverick model directly from here. Now, let's take a moment to understand why the mixture of experts architecture is particularly game-changing. In traditional LLMs, every token activates all parameters in the model. But with Llama 4's mixture of expert approach, each token activates only the parameters that it needs. For example, let's just take Llama for Maverick, which has 17 billion active parameters, but 400 billion total parameters. When processing a token, it goes to the shared expert and just one of the 128 specialized experts. In other words, this means you're getting the intelligence of a much larger model at a fraction of the computational cost. It's like having a 128 specialized consultants, but only calling on the exact expert you need in each specific question rather than asking everyone's opinion every single time. Overall, this approach will dramatically lower costs as well as latency and llama for Maverick can even run on a single H100. It makes it much more accessible for developers and businesses to be able to leverage this model. Another notable piece is they mentioned that this model was trained with overall 10 times more multilingual tokens than Llama 3. So, these series of models should perform quite well even in languages outside of English. Now to touch on Llama for Maverick in particular across a number of benchmarks as well as the price performance. In terms of the pricing here, this is a pricing estimate with a 3:1 blended ratio for input to output tokens. Most notably, this model is going to be considerably cheaper than GPD40, slightly more expensive than Gemini 2.0 Flash. We do see better performance basically across the board in a number of different benchmarks. This is bestin-class for image reasoning as well as understanding. Now in terms of coding as well as reasoning, these are slightly below the Deepseek B3.1 model. Then for MMLU Pro, we see that this is just shy of Deepseek V3.1 as well. What is really impressive for GPQA Diamond, this is bestin-class. Now, where you can really see that is when you compare to GPD40 with a 53.6 and for GPQA, we have a 69.8. We also see that this model is outperforming GP40 on multilingual capability. Now, in terms of long context, again, Maverick is the model with a million tokens of context. So, this is going to be comparable to the size Gemini 2.0 Flash, which also has a million tokens of context. Whereas, when we compare it to GPD40 as well as Deep Seek V3.1, both of those models respectively only have 128,000 tokens of context. Now, in terms of Llama 4 Scout, so this is one that definitely caught a lot of people's attention where we had the jump from 128,000 tokens of context for Llama 3 all the way up to an industry-leading 10 million tokens of context. They mentioned that this opens up a world of possibilities, including multi-document summarization, parsing extensive user activities, personalization tasks, and reasoning over vast code bases. Just to give you an idea, this is over 7,500 pages of text in a single prompt. You can imagine putting the works of Shakespeare, Lord of the Rings, as well as the entire Harry Potter series all within just one prompt. Also, this would be the equivalent of millions of lines of code within a single request to the Scout model. And they mention here that this highlights their long-term goal of supporting infinite context length. In terms of benchmarks, here is the needle in a haystack benchmark. We can see the respective results for llama for maverick scout as well as scout with video. That's one of the interesting things to remember with these models is that they are natively multimodal. They mentioned that they trained these models with a wide variety of images as well as video frames in order to give them broad visual understanding including of temporal activities and related images. Now in terms of the results for Llama 4 Scout when we compare that to the previous generation as well as models like Gemma Mistl 3.1 24B as well as Gemini 2.0 flash light we can see almost across the board for all of the benchmarks that they mentioned that this model basically outperforms with the exception of live codebench. Now in terms of the two trillion behemoth model now this model isn't available now but they are giving a glimpse in terms of what it looks like. Here are the results when compared to Claude Sonnet 3.7, Gemini 2.0 Pro, as well as GPT 4.5. And mind you, this model is still training and we can see across the board for all of the models that we have benchmarks for that it does outperform. By the looks of it, Llama 4 behemoth looks like it will be amongst, like I mentioned earlier, one of the best models in the world, especially in areas like coding as well as reasoning. Now, one thing to know with these models is these are just the first open source models of the Llama 4 collection. So, I anticipate over the year we're going to have a lot more Llama 4 releases with all of these models improving as well as that Llama 4 behemoth release. Now, in terms of being able to access the model, I think a lot of people are going to be excited here that you're going to be able to access both Scout and Maverick on Grock. And we can see how competitive the pricing is for both of those models at 11 cents per million tokens of input for Scout and only 34 cents per million tokens of output. And for Llama for Maverick, that model that is comparable to GPT40, we have that price at 50 cents per million tokens of input and 77 cents per million tokens of output. You're going to be able to access Scout at 460 tokens per second. And they mentioned that Llama for Maverick is coming today. Now, just to quickly demonstrate how fast Scout is on Grock, we can see here is me asking for 10 paragraphs and it's streaming out at over 500 tokens per second, even faster than the current speed stated. Now, in terms of some other metrics, this is the number one open source model, surpassing DeepSeek. This is tied at number one for hard prompts for coding, math, as well as creative writing on the LM arena, which is basically the platform that allows you to send in a prompt and you will vote on your preferred response between two different options. Otherwise, I'll put the links to everything that I showed within the description of the video. But if you found this video useful, please comment, share, and subscribe. Otherwise, until the next

LLAMA 4 in 9 Minutes - Developers Digest