
GLM-5.2
6 partsTL;DR
Unsloth's dynamic quantization makes GLM-5.2 runnable on a 256GB Mac or a 24GB GPU with CPU offloading. Here is the hardware math, the quantization tradeoffs, and what the HN community learned from actually running it.
GLM-5.2 is Z.ai's flagship open-weights model - 744 billion parameters total, 40 billion active (it is a mixture-of-experts architecture), and a 1 million token context window. Running it unquantized requires around 800GB of memory. Running it at all seemed out of reach for anyone without datacenter hardware.
Then Unsloth shipped dynamic quantization support for GLM-5.2, and the math changed.
The official Unsloth docs lay out the hardware requirements for each quantization level:
| Quantization | Memory Needed | Disk Space |
|---|---|---|
| 1-bit (UD-IQ1_S) | 223 GB | ~90 GB |
| 2-bit (UD-IQ2_M) | 245 GB | ~239 GB |
| 3-bit | 290-360 GB | ~300 GB |
| 4-bit | 372-475 GB | ~400 GB |
| 8-bit | 810 GB | ~800 GB |
The 2-bit quantization can fit on a 256GB unified memory Mac. It can also run on a single 24GB GPU (like an RTX 3090) with 256GB of system RAM using MoE offloading - the active parameters stay in VRAM while inactive experts get paged from system memory.
Unsloth's "dynamic" quantization approach preserves critical layers at higher precision while compressing less important ones. According to their testing, 4-bit dynamic is "essentially lossless" on standard benchmarks.
The discussion thread has 188 comments and the conversation centers on whether these quantization claims hold up in practice.
The skeptics: One commenter warns that "lossless" claims are "often made based on KL-divergence over some arbitrary corpus, not performance in the real world or benchmarks." Their experience: "I need to go a couple steps past whatever quantizations are good enough in the KL-divergence testing to get good performance in real tasks with long context. So when Q4 is claimed to be lossless I end up with Q5 or Q6."
The practical coders: Another commenter reports that for coding work specifically, "ideal range is at least Q8." The 2-bit and 4-bit variants work for general use but degrade on tasks requiring precise reasoning over long contexts.
The hardware math crowd: Several comments work through generation speed calculations. The formula is straightforward: token generation requires reading all active weights per token. With 40B active parameters at 4-bit quantization, that is 20GB of weight reads per token. Divide by your memory bandwidth to get tokens per second.
One commenter breaks it down: "With 100GB/s [memory bandwidth] you get 5 tokens per second." The RTX 3090 has roughly 936GB/s bandwidth, so you would expect around 40-50 tok/s if the weights fit entirely in VRAM - which they do not, hence the offloading penalty.
The cost reality check: An earlier thread claimed running GLM-5.2 locally would cost "$500k in hardware." Commenters here pushed back hard. The actual math: 6x RTX 6000 PRO Blackwell cards (576GB VRAM total) plus supporting hardware runs around $80-90k for 120 tok/s at NVFP4 precision. You could get 40 tok/s decode for under $50k.
A single GB300 workstation at the official $85k price point can also handle it, likely exceeding 120 tok/s.
The Mac crowd: M-series Macs with 256GB unified memory can run the 2-bit variant directly. One commenter estimates "M5 Ultra will ship before end of year" with 256GB max, though RAM shortages may limit availability.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jun 23, 2026 • 8 min read
Jun 22, 2026 • 6 min read
Jun 22, 2026 • 5 min read
Jun 22, 2026 • 8 min read
Unsloth provides a one-liner install:
curl -fsSL https://unsloth.ai/install.sh | sh
Then start the local inference server:
unsloth studio -H 0.0.0.0 -p 8888
The studio interface handles model download, GPU detection, and automatic offloading configuration. Models are available on Hugging Face.
For direct llama.cpp usage, you can reduce memory further with KV cache quantization. Using q4_0 cache quantization extends context capacity by roughly 3.5x at minimal quality cost for most tasks.
From the Unsloth docs, GLM-5.2's benchmark numbers:
| Benchmark | Score |
|---|---|
| AIME 2026 | 99.2% |
| SWE-bench Pro | 62.1% |
| MCP-Atlas | 76.8% |
The model includes three thinking modes: non-thinking, High, and Max. You control these via enable_thinking and reasoning_effort parameters. For coding tasks, Unsloth recommends temperature 1.0 and top-p 0.95 (or 1.0 for SWE-Bench Pro specifically).
The recommended settings for local deployment:
The HN discussion surfaced a practical question: when does running GLM-5.2 locally beat using the API?
Local wins when:
API wins when:
For the cost math on API access, see our GLM-5.2 free and cheap access guide. For comparing GLM-5.2 against other coding models, see the coding model showdown.
The fact that a 744B-parameter frontier model can run on consumer hardware - even with quality tradeoffs - marks a shift. A year ago, "local LLM" meant 7B or 13B models that could not compete with API offerings. Now the gap is narrowing.
Several commenters noted they are running Qwen3.6 27B (the non-MoE version) locally on 24GB cards for daily coding work. One described it as "smart enough to do debugging, refactoring, and implementing 'clean' specs" - not flagship-level, but genuinely useful.
GLM-5.2 at 2-bit is slower and potentially lower quality than the API version, but it is the same model. That is new territory for local inference.
Yes, if you have 256GB unified memory. The 2-bit quantization (UD-IQ2_M) fits in 245GB. M3 Ultra with 256GB works. M4 Max with 128GB does not - you would need to wait for M5 Ultra or go the Linux/Windows route with a GPU + system RAM setup.
A 24GB GPU (RTX 3090, RTX 4090, A6000) works with MoE offloading if you also have 256GB of system RAM. The active 40B parameters fit in VRAM; inactive experts page from system memory. Expect 5-15 tok/s depending on your memory bandwidth.
For general use and simple coding tasks, yes. For precise reasoning over long contexts, commenters report needing Q5 or Q6 to match full-precision behavior. The "lossless" claims are based on benchmark metrics that may not reflect your specific workload.
Smaller models (Qwen3 27B, Llama 3 70B) run faster and require less hardware, but have capability ceilings. GLM-5.2 at 2-bit is slower but has access to the same 744B parameter knowledge - the quantization compresses the weights, not the capability surface. Whether that tradeoff makes sense depends on your tasks.
Read next
Z.ai shipped GLM-5.2 in mid-June with a usable 1M-token context window, two thinking-effort levels, and MIT open weights now released. Here is the setup guide for Claude Code, pricing breakdown, and what to test before the benchmarks arrive.
8 min readGLM-5.2 ships under an MIT license, so it is hosted everywhere - and a few places run it for free or nearly free right now. Here is every way to access Z.ai's open-weights coding model, from OpenCode Go referral credits and Devin to the cheapest per-token routes on OpenRouter, Fireworks, and DeepInfra, plus local Ollama.
10 min readChoosing a local coding LLM in 2026 means balancing benchmark performance, hardware cost, and the compliance pressure to keep code off third-party servers. Here is what to run and on what hardware.
8 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Open-source OpenAI API replacement. Runs LLMs, vision, voice, image, and video models on any hardware - no GPU require...
View ToolThe easiest way to run LLMs locally. One command to pull and run any model. OpenAI-compatible API. 52M+ monthly download...
View ToolDesktop app for discovering, downloading, and running local LLMs. Clean chat UI, OpenAI-compatible API server, and autom...
View ToolC++ inference engine for LLMs. GGUF format, quantization, CPU and Metal/CUDA support. The foundation most local tools bu...
View ToolInstall Ollama and LM Studio, pull your first model, and run AI locally for coding, chat, and automation - with zero cloud dependency.
Getting StartedBackground monitoring of logs, files, and long-running processes.
Claude CodeUse opus, sonnet, haiku, and best to switch models easily.
Claude Code
Z.ai shipped GLM-5.2 in mid-June with a usable 1M-token context window, two thinking-effort levels, and MIT open weights...

GLM-5.2 ships under an MIT license, so it is hosted everywhere - and a few places run it for free or nearly free right n...

Choosing a local coding LLM in 2026 means balancing benchmark performance, hardware cost, and the compliance pressure to...

New benchmark data shows GPT-5.5 hallucinates 86% of the time when it does not know the answer - versus 28% for the open...

A new paper shows a 3B parameter model hitting 94.3 on AIME26 and 96.1% on LeetCode contests - matching or exceeding mod...

Switzerland's fully open foundation model promises transparent training data and EU compliance. The HN crowd has questio...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.