Qwen 3: Alibaba's Open-Source Model That Outclassed Llama 4

Official Sources#

Resource	Link
Qwen Official Site	qwen.ai
Qwen Chat Interface	chat.qwen.ai
Hugging Face - Qwen3-235B-A22B	huggingface.co/Qwen/Qwen3-235B-A22B
Hugging Face - Qwen3-30B-A3B	huggingface.co/Qwen/Qwen3-30B-A3B
GitHub Repository	github.com/QwenLM/Qwen3
Qwen3 Technical Report	arxiv.org/abs/2505.09388
Alibaba Cloud Model Studio	alibabacloud.com/product/model-studio
Ollama Models	ollama.com/library/qwen3

Last updated: May 24, 2026. Verify model availability, hardware requirements, and licensing terms against the official Qwen documentation before production deployment.

Eight Models, One Apache 2 License#

Alibaba's Qwen team released Qwen 3 at the end of April 2025, and the timing could not have been better. Llama 4 had launched at the beginning of the month to significant fanfare. Four weeks later, Qwen 3 arrived and outperformed it across nearly every benchmark.

For model-selection context, compare this with Claude vs GPT for Coding: Which Model Writes Better TypeScript? and OpenAI vs Anthropic in 2026 - Models, Tools, and Developer Experience; the useful question is not only benchmark quality, but where the model fits in a real developer workflow.

The release includes eight models. Six are dense architectures ranging from 600 million parameters up to 32 billion parameters. Two are mixture-of-experts (MoE) models: the flagship at 235 billion total parameters with 22 billion active, and a smaller variant at 30 billion total with 3 billion active parameters.

Every model ships under an Apache 2 license. No restrictions on commercial use. No special agreements needed. Download, deploy, fine-tune, and ship to production without legal overhead.

The Flagship: 235B MoE#

The headline model is the 235 billion parameter MoE. With only 22 billion parameters active per forward pass, it delivers the knowledge capacity of a massive model with the inference cost of a much smaller one. This is the same architectural advantage that makes DeepSeek V3 so efficient, applied by a different team with different training data and techniques.

Where Qwen 3's flagship really shines is in coding benchmarks. The scores outperform nearly every model in the comparison, with some exceptions like Gemini 2.5 Pro. One notable omission from Alibaba's benchmark charts: the Claude Sonnet series. Claude 3.5 and 3.7 Sonnet were strong coding models that would have been useful reference points.

The direct comparison with Llama 4 Maverick is where the numbers become striking. Qwen 3's flagship beats Maverick on general tasks, mathematical reasoning, multilingual benchmarks, and coding tasks. The only exception is a single multilingual benchmark where Maverick leads by one basis point. Qwen 3 achieves this while being considerably smaller than Maverick, meaning it runs cheaper and faster.

The Reddit reaction at the time summarized it well: "Rest in peace Llama 4, April 2025 to April 2025."

The Small MoE: 30B That Punches Way Up#

The smaller MoE model might be the more impressive release. At 30 billion total parameters with 3 billion active, it is small enough to run on consumer hardware. Despite that, its benchmark scores rival and often exceed GPT-4o.

On CodeForces, the 30B model more than doubles GPT-4o's score. On LiveCodeBench, it nearly doubles it. On the AIME math benchmark, the scores are multiples of what GPT-4o, DeepSeek, and Gemma achieve.

Getting GPT-4o-level performance from a model that runs locally on a laptop is a meaningful shift. The MoE architecture makes this possible because you only need enough memory and compute for the active parameters during inference, not the total parameter count.

Hybrid Thinking: One Model, Two Modes#

Qwen 3 introduces Alibaba's first hybrid thinking mode. The concept is straightforward: for complex problems that benefit from step-by-step reasoning, the model enters a thinking mode where it works through the problem before delivering an answer. For simpler questions, it responds immediately without the reasoning overhead.

This mirrors the approach taken by OpenAI with O1/O3 and by Anthropic with extended thinking, but applied to an open-source model. You control the tradeoff through a thinking budget, measured in tokens.

The benchmark data shows a clear correlation between thinking budget and performance. On AIME, LiveCodeBench, and GPQA, allocating more tokens to the thinking process produces better results, up to the 32,000 token ceiling. The relationship is roughly linear: double the thinking budget, get a measurable quality improvement.

The tradeoff is cost and latency. Thinking tokens are generated tokens. More thinking means longer wait times and higher inference bills. For production applications, you would tune the thinking budget based on the task. A code completion suggestion does not need 32,000 tokens of reasoning. A complex architectural question might.

Newsletter

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools, delivered free every week.

From the archive

Diffusion Language Models: How Mercury Changed the LLM Speed Game

Feb 27, 2025 • 7 min read

xAI Grok 3 Launch: The Smartest AI on Earth?

Feb 18, 2025 • 9 min read

Unstract: Open-Source AI Document Parsing at Scale

Feb 12, 2025 • 10 min read

OpenAI Deep Research: The AI Agent That Does Your Homework

Feb 3, 2025 • 7 min read

119 Languages and Dialects#

Qwen 3 supports 119 languages and dialects. For developers building products for non-English markets, this is a significant differentiator. Most open-source models are English-first with varying levels of support for other languages. Qwen 3 was explicitly trained for broad multilingual capability.

The 36 trillion token training dataset, double what Qwen 2.5 used, drew from web data and PDF documents. The team used Qwen 2.5 VL to extract text from documents and Qwen 2.5 to improve the quality of the extracted content. For math and code data, they used synthetic data generation from Qwen 2.5 Coder and Qwen 2.5 Math.

One limitation: Qwen 3 models are text-in, text-out only. No multimodal inputs. No image generation. No audio processing. These are pure language models.

Agentic Capabilities and MCP Support#

The Qwen 3 models were specifically trained for agentic workflows. Tool calling, MCP integration, and multi-step task execution are first-class capabilities.

The blog post included demonstrations of the model working with an MCP interface, selecting appropriate tools based on the task, and chaining tool calls to complete operations. For developers building agent systems, having a model that handles tool selection reliably is essential. Poor tool-calling accuracy breaks the entire agent loop.

This makes Qwen 3 particularly interesting for the growing ecosystem of MCP-based applications. If you are building an agent that needs to interact with databases, file systems, APIs, or other tools through MCP servers, Qwen 3 was designed with that workflow in mind.

How to Run Qwen 3#

Chat Interface#

The quickest way to try the models is at chat.qwen.ai. At launch, the interface offered both MoE models (235B and 30B) as well as the 32B dense model.

Local Deployment#

For running models locally, the options include:

Terminal

# Ollama (simplest option)
ollama run qwen3:8b

# For the larger models, specify the variant
ollama run qwen3:32b

The models are also available through LM Studio, MLX, Llama.cpp, and K-Transformers. Model files range from a few gigabytes for the smallest variants to significantly more for the larger ones. Any model 8 billion parameters or larger supports a 128,000 token context window.

API Access#

The models are available on Hugging Face, ModelScope, and Kaggle. You can pull them down directly or deploy through the hosting options those platforms provide.

For production inference, the MoE models offer the best value: higher quality per compute dollar than the dense models, thanks to the efficient routing architecture.

Context Windows#

The context length scales with model size:

Model Size	Context Length
0.6B - 4B	32,000 tokens
8B+	128,000 tokens

128,000 tokens is sufficient for most application workloads. It covers full codebases, long documents, and extended conversation histories without truncation.

Training Details#

The scale of the training pipeline is notable. Qwen 3 was trained on approximately 36 trillion tokens, doubling the 18 trillion tokens used for Qwen 2.5. The data came from multiple sources:

Web data scraped and filtered for quality
PDF documents with text extracted using Qwen 2.5 VL (the vision-language model) and quality improved using Qwen 2.5
Synthetic math data generated by Qwen 2.5 Math
Synthetic code data generated by Qwen 2.5 Coder

Using the previous generation of models to improve training data for the next generation is a pattern we see across the industry. It creates a compounding effect where each model release produces better data for the next one.

The decision to use Qwen 2.5 VL for document extraction is practical. PDF documents contain charts, tables, and formatted text that simple text extraction misses. A vision-language model can read the visual layout and produce more accurate text representations. This gives Qwen 3 better understanding of structured information like technical documentation, research papers, and financial reports.

Practical Coding Performance#

First impressions from hands-on testing were positive. When given web development tasks starting from simple prompts and progressively adding complexity, the model produced output comparable to Claude 3.7 Sonnet and Gemini 2.5 Pro.

For a fully open-source model, matching proprietary models on practical coding tasks is the real benchmark that matters. Academic benchmarks measure specific capabilities in controlled conditions. Real-world coding involves understanding vague requirements, making reasonable design choices, and producing clean, working code. Qwen 3 performed well on all three.

The community reception was enthusiastic. One Reddit comment captured the reaction to the smallest models: "A 4GB file programming better than me." The combination of small file size and strong coding performance made the model immediately accessible to developers who had never run a local LLM before.

Where Qwen 3 Fits in the Landscape#

The open-source model landscape in April 2025 was moving fast. Llama 4 launched early in the month. Qwen 3 arrived at the end. Within weeks, the benchmarks showed Qwen 3 ahead on nearly every metric.

For developers choosing an open-source model for production, Qwen 3 offered:

Best-in-class coding performance among open-source options
Efficient MoE architecture that reduces inference costs
Hybrid thinking for complex reasoning tasks
Broad language support for international products
Native agentic capabilities for tool-calling and MCP workflows
Apache 2 license with no commercial restrictions

The pace of open-source model releases in 2025 meant that any model's lead was temporary. DeepSeek, Llama, and others were all working on their next releases. But at the time of its launch, Qwen 3 was the strongest open-source model available, particularly for coding and reasoning tasks.

The smaller 30B MoE model deserves special attention. Being able to run a model locally that competes with GPT-4o on coding benchmarks, using hardware you already own, is the kind of shift that changes how developers think about AI integration. No API keys. No usage limits. No data leaving your machine. That is the promise of open-source AI models, and Qwen 3 delivered on it more convincingly than any release before it.

Frequently Asked Questions#

What is Qwen 3 and who made it?#

Qwen 3 is a family of large language models developed by Alibaba's Qwen team. The release includes eight models ranging from 600 million to 235 billion parameters, all under an Apache 2 license. The flagship is a 235B mixture-of-experts model with 22 billion active parameters per forward pass, designed for coding, reasoning, and multilingual tasks.

How does Qwen 3 compare to Llama 4?#

Qwen 3's flagship model outperforms Llama 4 Maverick on nearly every benchmark - general tasks, mathematical reasoning, multilingual capability, and coding. The only exception is one multilingual benchmark where Maverick leads by one basis point. Qwen 3 achieves this while being smaller, meaning it runs cheaper and faster.

Can I run Qwen 3 locally on my own hardware?#

Yes. The smaller models, especially the 30B MoE with only 3 billion active parameters, can run on consumer hardware. You can use Ollama (ollama run qwen3:8b), LM Studio, MLX, or Llama.cpp. The 30B MoE delivers GPT-4o-level performance from a model that fits on a laptop.

What is the Qwen 3 context window size?#

Models 8 billion parameters and larger support a 128,000 token context window. Smaller models (0.6B to 4B) support 32,000 tokens. The 128K context is sufficient for full codebases, long documents, and extended conversation histories.

What is hybrid thinking mode in Qwen 3?#

Hybrid thinking lets the model choose between immediate responses and step-by-step reasoning. For complex problems, the model enters a thinking mode similar to OpenAI's o1 or Anthropic's extended thinking. You control the tradeoff through a thinking budget measured in tokens - more thinking tokens generally produce better results on hard problems.

Does Qwen 3 support tool calling and MCP?#

Yes. Qwen 3 was specifically trained for agentic workflows with first-class support for tool calling and MCP integration. The model handles tool selection reliably, making it suitable for agent systems that need to interact with databases, file systems, APIs, and other tools through MCP servers.

What languages does Qwen 3 support?#

Qwen 3 supports 119 languages and dialects. Unlike most open-source models that are English-first, Qwen 3 was explicitly trained for broad multilingual capability, making it a strong choice for products targeting non-English markets.

Is Qwen 3 free for commercial use?#

Yes. All Qwen 3 models ship under an Apache 2 license with no restrictions on commercial use. You can download, deploy, fine-tune, and ship to production without special agreements or legal overhead.