
Qwen 3 is here! 🎉 In this video, I dive into Alibaba's latest series of models, featuring six dense models ranging from 600 million to 32 billion parameters and two mixture of experts models. The flagship model boasts 235 billion parameters with 22 billion active ones. 🚀 I cover how to access and utilize these models, including platforms like Hugging Face and Kaggle. These models excel at coding tasks, outperforming others like Gemini 2.5 Pro. 💻 I'll also touch on some unique features like hybrid thinking mode and agentic capabilities. For more details, you can pull these models from Hugging Face or run them locally. Don't miss out on trying them at chat.qwen.ai. Enjoy! 00:00 Introduction to Qwen Three 00:11 Overview of New Models 00:44 Performance Highlights 02:18 Accessing the Models 02:40 Hybrid Thinking Model 04:15 Training and Data Details 05:12 Comparison with Other Models 06:54 User Impressions and Reactions 08:46 Conclusion and Final Thoughts
--- type: transcript date: 2025-04-29 youtube_id: K3MlgNf__bc --- # Transcript: Qwen 3 in 8 Minutes Quen 3 is now here. In this video, I'm going to go over the latest series of models from the team over at Alibaba. We'll go through the blog post and then we'll touch on some pieces of how you can access the model and try all of this out. In terms of the models that they're announcing today, they're announcing six dense models ranging from 600 million parameters all the way up to 32 billion parameters as well as two mixture of experts models. In terms of the flagship model that came out, this is a 235 billion parameter model with 22 billion active parameters. And then they also have this smaller version which is a 30 billion parameter with 3 billion active parameters. There are a ton of really impressive metrics across this model. One thing to know with all of these models is they all come with an Apache 2 license. Where these models really shine is in coding task. we see scores that basically outperform almost across the board with some exceptions like Gemini 2.5 Pro for a number of coding tasks. Now, the one thing to note with this chart is I did notice that they do exclude the Claude series of models and the Claude Sonnet series, for instance, Sonnet 3.5 as well as Sonnet 3.7. Those models in particular are very strong at coding. It would have been nice to also see those models within this chart as well. Now, of the models that they released, one of the really impressive ones is their other mixture of experts model, the one that's considerably smaller at at 30 billion parameters. When we rank this across even GBD40, we can see basically across the board with some very large margin, we can see how well this performs for code forces. We can see this is more than doubled. For live codebench, it's almost doubled. We can see for the Amy scores that this is multiples of both GPD40, DeepSeek as well as Gemma. And basically across these math and coding benchmarks to get scores that open even GP40 is really quite amazing. The other nice thing with a mixture of experts model is basically you have a better model, but you're able to run it at much lower inference cost because you're only going to have those active parameters required when you actually run inference on the model. Now, in terms of the context length for the models, basically they range between 32,000 all the way up to 128,000. Basically, for any model that is larger than 8 billion parameter model, you're going to be able to leverage 128,000 tokens of context. In terms of being able to access the model, you can access the 30 billion parameter model through platforms like hugging face model scope as well as kegle. If you do want to run this on your own hardware, especially since there is quite a range, you definitely are going to be able to run at least one of these models. You can pull that down with whether it's Olama, LM Studio, MLX, Llama CPP, as well as K transformers. Now, in terms of some features of the models, it is a hybrid thinking model. This is Alibaba's first hybrid thinking model. The thinking mode allows you to think through step by step before delivering an answer for complex problems that require deeper thought. Or alternatively, they can provide quick near instant responses which can be suitable for a lot of questions that don't require that typical thinking mode. Here we can see an example across the different models, the Amy models as well as live codebench and GPQA. Basically, what we have here is the eval and then we also have the thinking budget all the way up to 32,000 tokens. For instance, the more tokens that you allocate towards that thinking time, the better the result. But obviously the thinking process does come at the trade-off of having to wait for the answer and also it will incur that additional cost of having to run through the inference depending on the thinking budget that was set. Now for the Quen 3 models, they support up to 119 languages as well as dialects. If you do speak other languages, this could potentially be a very good option to explore. The other thing to know with the models is they are optimized for Gentic capabilities. If you want to leverage a model for things like MCP or just tool calling, these series of models have been trained with those capabilities in mind. They have this demonstration within the blog post of it demonstrating an MCP interface where it's going through a couple different examples, basically highlighting that capability where if you equip that model with a number of different tools, it can go through and determine at which point it should call those different tools to ultimately finish whatever operation it might be. Now, in terms of some of the training details, they basically doubled the number of tokens that Quen 3 was trained on, approximately 36 trillion tokens, whereas Quen 2.5 was trained on 18 million tokens. For this data set, they collected not only from the web, but also from PDF- like documents. What they did is they leveraged Quen 2.5VL to extract the text from the documents and Quen 2.5 to improve the quality of the extracted content. The way that they did that is they leveraged their previous model to extract the text from these documents to improve the quality of the extracted content. And then for math and code data, they also mentioned that they did use synthetic data generation from both Quen 2.5 coder as well as Quen 2.5 math. The other thing to mention with these models is they are just textin text out models. So they don't have multimodality. You're not going to be able to generate images or pass in audio or those types of things. These are texton models right now. Of all the charts within the blog post, this one probably stood out most to me. And the reason for that is we had Llama 4 just come out at the beginning of April. And now at the end of April, we have Quen. If we compare it to Llama 4 Maverick, which just came out with this latest Quen 3 flagship model. First off, right off the bat, this model is considerably smaller than Maverick. You're going to be able to run it cheaper as well as faster than the Maverick model. But if we look across the board, we can see for general tasks, mathematical tasks, multilingual tasks, as well as code tasks, pretty much across the board, we can see for all of those stated eval outperforms all of them with just the exception of the include multilingual benchmark where it's just shy by one basis point. In other words, the reaction that I saw to this on Reddit was basically rest in peace to Llama 4, April 2025 to April 2025. And that's just the nature of the pace of things right now within AI is it's only a matter of weeks before these different companies leapfrogging the different benchmarks across the board. We've seen this from OpenAI, we've seen this from Anthropic, we've seen this from Gemini. I wouldn't be surprised in a number of weeks or days that these benchmarks here from Quen 3 will be surpass another open source model whether it's Deepseek the latest version of Llama or the much anticipated open source model expected from AI coming out over the coming months. In terms of the blog post, if you're interested in more information on things like post- trainining or in terms of how to actually develop with the model, there's all sorts of information within this. If you want to see some examples of running inference or some agentic use cases, I'll link all of that within the description. In terms of being able to access the model, there's a ton of different options out there that you can choose from. Now, arguably one of the easiest ways on trying out the model is if you go to chat.quen.ai, you can try out this interface. At the time of recording, they have both of the mixture of experts model, both the 235 billion parameter as well as the 30 billion parameter model as well as the 32 billion dense model. I was trying this out. This is an example of a web development task where I basically asked it with just a number of prompts to have a simple SAS interface and I basically asked it to continue to make it a little bit more complicated with each generation and ultimately make it look a bit nicer. My initial first impressions is I was quite impressed, especially for a model that's completely open source. In terms of this initial type of query that I often do test on other models, it gives a similar, if not better response than things like both Claude 3.7 as well as Gemini 2.5 Pro. So, I was very impressed with my initial impression of the model. Additionally, if you do want to dive into the specifics of each of the models, you can go ahead and head on over to their collection on HuggingFace. I'll link this within the description of the video as well. And from Hugging Face, you can go ahead and pull down the model or you can go ahead and deploy it to some of the options that they have on here as well. Personally, the way that I leverage local models is from Olama. It's really easy to get started. So, basically, depending on the hardware, you can go and select one of the models. You can also see how much space these models will take up. For instance, it defaults to the 8 billion parameter model, but if you want to specify to one of the smaller ones, you can go ahead install O Lama, copy the command, put that within your terminal, and within a number of minutes, depending on the size of the model, you'll be able to run this locally. Now, just quickly, in terms of some of the reactions, I did see some pretty good reactions on Reddit here. That's it. A 4 GB file programming better than me. And this one in particular, like I mentioned earlier, rest in peace, Llama 4, April 2025 to April 2025. Otherwise, that's pretty much it for this video. Kudos to the team over at Alibaba for this release. If you found this video useful, please comment, share, and subscribe.
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.