TL;DR
Alibaba released Qwen 3 with eight models under an Apache 2 license, including a 235B mixture-of-experts flagship that beats Llama 4 Maverick on nearly every benchmark while being smaller and cheaper to run.
Alibaba's Qwen team released Qwen 3 at the end of April 2025, and the timing could not have been better. Llama 4 had launched at the beginning of the month to significant fanfare. Four weeks later, Qwen 3 arrived and outperformed it across nearly every benchmark.
The release includes eight models. Six are dense architectures ranging from 600 million parameters up to 32 billion parameters. Two are mixture-of-experts (MoE) models: the flagship at 235 billion total parameters with 22 billion active, and a smaller variant at 30 billion total with 3 billion active parameters.
Every model ships under an Apache 2 license. No restrictions on commercial use. No special agreements needed. Download, deploy, fine-tune, and ship to production without legal overhead.
The headline model is the 235 billion parameter MoE. With only 22 billion parameters active per forward pass, it delivers the knowledge capacity of a massive model with the inference cost of a much smaller one. This is the same architectural advantage that makes DeepSeek V3 so efficient, applied by a different team with different training data and techniques.
Where Qwen 3's flagship really shines is in coding benchmarks. The scores outperform nearly every model in the comparison, with some exceptions like Gemini 2.5 Pro. One notable omission from Alibaba's benchmark charts: the Claude Sonnet series. Claude 3.5 and 3.7 Sonnet were strong coding models that would have been useful reference points.
The direct comparison with Llama 4 Maverick is where the numbers become striking. Qwen 3's flagship beats Maverick on general tasks, mathematical reasoning, multilingual benchmarks, and coding tasks. The only exception is a single multilingual benchmark where Maverick leads by one basis point. Qwen 3 achieves this while being considerably smaller than Maverick, meaning it runs cheaper and faster.
The Reddit reaction at the time summarized it well: "Rest in peace Llama 4, April 2025 to April 2025."
The smaller MoE model might be the more impressive release. At 30 billion total parameters with 3 billion active, it is small enough to run on consumer hardware. Despite that, its benchmark scores rival and often exceed GPT-4o.
On CodeForces, the 30B model more than doubles GPT-4o's score. On LiveCodeBench, it nearly doubles it. On the AIME math benchmark, the scores are multiples of what GPT-4o, DeepSeek, and Gemma achieve.
Getting GPT-4o-level performance from a model that runs locally on a laptop is a meaningful shift. The MoE architecture makes this possible because you only need enough memory and compute for the active parameters during inference, not the total parameter count.
Qwen 3 introduces Alibaba's first hybrid thinking mode. The concept is straightforward: for complex problems that benefit from step-by-step reasoning, the model enters a thinking mode where it works through the problem before delivering an answer. For simpler questions, it responds immediately without the reasoning overhead.
This mirrors the approach taken by OpenAI with O1/O3 and by Anthropic with extended thinking, but applied to an open-source model. You control the tradeoff through a thinking budget, measured in tokens.
The benchmark data shows a clear correlation between thinking budget and performance. On AIME, LiveCodeBench, and GPQA, allocating more tokens to the thinking process produces better results, up to the 32,000 token ceiling. The relationship is roughly linear: double the thinking budget, get a measurable quality improvement.
The tradeoff is cost and latency. Thinking tokens are generated tokens. More thinking means longer wait times and higher inference bills. For production applications, you would tune the thinking budget based on the task. A code completion suggestion does not need 32,000 tokens of reasoning. A complex architectural question might.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
Qwen 3 supports 119 languages and dialects. For developers building products for non-English markets, this is a significant differentiator. Most open-source models are English-first with varying levels of support for other languages. Qwen 3 was explicitly trained for broad multilingual capability.
The 36 trillion token training dataset, double what Qwen 2.5 used, drew from web data and PDF documents. The team used Qwen 2.5 VL to extract text from documents and Qwen 2.5 to improve the quality of the extracted content. For math and code data, they used synthetic data generation from Qwen 2.5 Coder and Qwen 2.5 Math.
One limitation: Qwen 3 models are text-in, text-out only. No multimodal inputs. No image generation. No audio processing. These are pure language models.
The Qwen 3 models were specifically trained for agentic workflows. Tool calling, MCP integration, and multi-step task execution are first-class capabilities.
The blog post included demonstrations of the model working with an MCP interface, selecting appropriate tools based on the task, and chaining tool calls to complete operations. For developers building agent systems, having a model that handles tool selection reliably is essential. Poor tool-calling accuracy breaks the entire agent loop.
This makes Qwen 3 particularly interesting for the growing ecosystem of MCP-based applications. If you are building an agent that needs to interact with databases, file systems, APIs, or other tools through MCP servers, Qwen 3 was designed with that workflow in mind.
The quickest way to try the models is at chat.qwen.ai. At launch, the interface offered both MoE models (235B and 30B) as well as the 32B dense model.
For running models locally, the options include:
# Ollama (simplest option)
ollama run qwen3:8b
# For the larger models, specify the variant
ollama run qwen3:32b
The models are also available through LM Studio, MLX, Llama.cpp, and K-Transformers. Model files range from a few gigabytes for the smallest variants to significantly more for the larger ones. Any model 8 billion parameters or larger supports a 128,000 token context window.
The models are available on Hugging Face, ModelScope, and Kaggle. You can pull them down directly or deploy through the hosting options those platforms provide.
For production inference, the MoE models offer the best value: higher quality per compute dollar than the dense models, thanks to the efficient routing architecture.
The context length scales with model size:
| Model Size | Context Length |
|---|---|
| 0.6B - 4B | 32,000 tokens |
| 8B+ | 128,000 tokens |
128,000 tokens is sufficient for most application workloads. It covers full codebases, long documents, and extended conversation histories without truncation.
The scale of the training pipeline is notable. Qwen 3 was trained on approximately 36 trillion tokens, doubling the 18 trillion tokens used for Qwen 2.5. The data came from multiple sources:
Using the previous generation of models to improve training data for the next generation is a pattern we see across the industry. It creates a compounding effect where each model release produces better data for the next one.
The decision to use Qwen 2.5 VL for document extraction is practical. PDF documents contain charts, tables, and formatted text that simple text extraction misses. A vision-language model can read the visual layout and produce more accurate text representations. This gives Qwen 3 better understanding of structured information like technical documentation, research papers, and financial reports.
First impressions from hands-on testing were positive. When given web development tasks starting from simple prompts and progressively adding complexity, the model produced output comparable to Claude 3.7 Sonnet and Gemini 2.5 Pro.
For a fully open-source model, matching proprietary models on practical coding tasks is the real benchmark that matters. Academic benchmarks measure specific capabilities in controlled conditions. Real-world coding involves understanding vague requirements, making reasonable design choices, and producing clean, working code. Qwen 3 performed well on all three.
The community reception was enthusiastic. One Reddit comment captured the reaction to the smallest models: "A 4GB file programming better than me." The combination of small file size and strong coding performance made the model immediately accessible to developers who had never run a local LLM before.
The open-source model landscape in April 2025 was moving fast. Llama 4 launched early in the month. Qwen 3 arrived at the end. Within weeks, the benchmarks showed Qwen 3 ahead on nearly every metric.
For developers choosing an open-source model for production, Qwen 3 offered:
The pace of open-source model releases in 2025 meant that any model's lead was temporary. DeepSeek, Llama, and others were all working on their next releases. But at the time of its launch, Qwen 3 was the strongest open-source model available, particularly for coding and reasoning tasks.
The smaller 30B MoE model deserves special attention. Being able to run a model locally that competes with GPT-4o on coding benchmarks, using hardware you already own, is the kind of shift that changes how developers think about AI integration. No API keys. No usage limits. No data leaving your machine. That is the promise of open-source AI models, and Qwen 3 delivered on it more convincingly than any release before it.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Fastest inference for open-source models. 200+ models via unified API. Ranks #1 on speed benchmarks for DeepSeek, Qwen,...
View ToolMeta's open-source model family. Llama 4 available in Scout (17B active) and Maverick (17B active, 128 experts). Free to...
View Tool
New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Open-source reasoning models from China. DeepSeek-R1 rivals o1 on math and code benchmarks. V3 for general use. Fully op...

Introducing Meta Llama 3: The most capable openly available LLM to date Meta has released two groundbreaking AI models under the Lama 3 series, an 8 billion parameter model and a 70 billion...

NVIDIA just released Nemotron Nano 2 VL - an open-source vision language model that's 4x more efficient than previous models. In this video, I break down what makes this 12-billion parameter...

Unveiling Qwen 3 Coder: The Most Powerful Open-Source Code Model by Alibaba In this video, we explore Qwen 3 Coder, Alibaba's latest and most powerful open-source AI code model. With 480 billion...

DeepSeek's R1 and V3 models deliver frontier-level performance under an MIT license. Here's how to use them through the...

Meta's Llama 4 family brings mixture-of-experts to open source with Scout and Maverick. Here's how to run them locally,...

NVIDIA's Nemotron 3 Super combines latent mixture of experts with hybrid Mamba architecture - 120B total parameters, 12B...