
TL;DR
Modern LLMs now use MoE routing, mixed attention variants, and fused vision encoders. The simple transformer stack is gone - here's what replaced it and why it matters for developers.
The original transformer architecture from "Attention Is All You Need" (2017) was elegant. Feed-forward layers, multi-head attention, residual connections. You could sketch it on a napkin. That era is over.
A recent post by Ian Barber argues that LLM architectures have crossed a complexity threshold similar to what happened with recommendation systems a decade ago. The simple stack that powered GPT-2 and early GPT-3 has been replaced by a maze of optimization techniques that are now load-bearing.
Modern frontier models deploy a grab-bag of architectural innovations:
Attention variants everywhere. Models now use "query grouping, compressed, sparse, linear, sliding-window" attention - sometimes multiple variants in the same model. Grouped Query Attention (GQA) alone has become standard because it dramatically reduces KV cache memory during inference.
Mixture-of-Experts routing. MoE extends from feedforward blocks to the residual stream itself in recent architectures. DeepSeek-V3 routes to 256 experts per layer with only 8 active per token. The routing decision is trainable, introducing new failure modes.
Vision and audio encoders are no longer bolted on. Early multimodal approaches treated vision as a separate tower feeding into a frozen language model. Current architectures weave modality encoders throughout the stack, with cross-attention at multiple layers.
Multi-GPU comms become architectural. Once inference spans multiple GPUs, communication operations (all-reduce, all-gather) become part of the computation graph. Where you place a layer split matters for latency.
Barber draws a useful parallel to recommendation systems. For most of the 2010s, recommendation models were relatively straightforward two-tower architectures with sparse embeddings. They got complicated when:
"The gap between performance being an optimization and performance being a necessity became very, very small."
At scale, every percentage point of efficiency matters. The original transformer had known inefficiencies - quadratic attention scaling, dense activations, redundant computation. One by one, researchers found optimizations. Each optimization hardened into the baseline.
The problem: once an optimization is load-bearing, you cannot easily experiment without it. Removing GQA from a modern model means you cannot fit the same context length. Removing MoE means you cannot match the same capabilities at the same inference cost. Barber puts it directly:
"You can't hand-fuse your way back without investing significant time that might not be worth it."
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jun 20, 2026 • 11 min read
Jun 19, 2026 • 8 min read
Jun 19, 2026 • 8 min read
Jun 19, 2026 • 8 min read
The HN discussion surfaced several practical observations from people working with these models.
One commenter tracking llama.cpp development noted the implementation gap:
"The earlier models were always fully implemented. Yet with more contributors, as of today tons of latest models only have partial implementation. DeepSeekv3.2 isn't fully implemented, same with KimiK2.6, GLM5.2+, DeepSeekv4 has no implementation, MiniMaxM3 not supported yet."
The architectural diversity means inference libraries cannot keep up. Features that a model relies on may simply not exist in your preferred runtime.
Another commenter framed it as the "bitter lesson lifecycle":
"When a technique or technology is new people are making massive gains by just applying it to some use case, or gathering more data for training. As time goes on those 'bitter lesson' gains start to hit the shallow part of the logistic curve and companies have to start investing more and more effort into engineering for each small, incremental gain."
The easy scaling gains are behind us. What remains is feature engineering at the architecture level.
Barber highlights FlexAttention in PyTorch as a potential solution. FlexAttention lets you define custom attention patterns that compile down to efficient Triton kernels:
from torch.nn.attention.flex_attention import flex_attention, create_block_mask
def causal_mask(b, h, q_idx, kv_idx):
return q_idx >= kv_idx
block_mask = create_block_mask(causal_mask, B, H, Q_LEN, KV_LEN)
output = flex_attention(query, key, value, block_mask=block_mask)
The abstraction captures sliding window, causal, document masking, and custom patterns without hand-writing CUDA. Performance hits are mild (single-digit percentage overhead vs. hand-tuned kernels).
This matters because it restores some ability to experiment. If attention is composable and fast, researchers can try new patterns without reimplementing everything from scratch.
If you are building on top of LLMs rather than training them, three implications stand out:
Model selection is more than benchmarks. Two models with similar scores on a benchmark may have very different architectural requirements. One may need 2x the VRAM for the same context length because it uses different attention. Check the architecture, not just the numbers.
Inference libraries are fragmented. Llama.cpp, vLLM, TensorRT-LLM, and SGLang each support different model architectures to different degrees. A model's claimed features may not work in your runtime. Test with your actual stack.
Local deployment complexity is rising. Running a 7B model locally in 2023 was straightforward. Running a current MoE model with the same parameter count requires understanding expert routing, activation memory, and potentially multi-GPU splits even at small scale.
The upside: if architectural complexity is where the gains are, well-optimized inference is a genuine competitive advantage. Companies investing in inference engineering are not just saving compute costs - they are enabling capabilities that would otherwise be impractical.
There is an analogy in frontend development. React started simple - a render function, virtual DOM diffing, done. Then came hooks, concurrent rendering, server components, streaming, suspense boundaries. Each addition solved real problems at scale. Each addition made the mental model more complex.
LLMs are following the same arc, just faster. The transformer paper is 9 years old. The complexity explosion happened in roughly 3 years.
The question Barber poses is whether composable abstractions like FlexAttention can prevent this from becoming unmanageable. The alternative is that only large labs with custom CUDA kernels can push the frontier - a consolidation that would slow down the field.
Read next
The new wrangler deploy --temporary flag creates ephemeral Cloudflare accounts for AI agents. 60-minute deployments, no OAuth, no browser - just deploy and claim later.
5 min readNew benchmark data shows GPT-5.5 hallucinates 86% of the time when it does not know the answer - versus 28% for the open-weights GLM-5.2. The numbers challenge the assumption that bigger models equal more reliable output.
6 min readThe Transformer co-creator leaves Google DeepMind for OpenAI just two years after Google paid $2.7 billion to bring him back from Character.AI.
5 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
AI voice dictation for macOS. Works in any app - code editors, browsers, notes. Understands context and formats output...
View ToolAnthropic's smallest Claude 4.5 model. Near-frontier coding performance at one-third the cost of Sonnet 4 and up to 4-5x...
View ToolType-safe Python agent framework from the Pydantic team. Brings the FastAPI feeling to AI development. Composable tools,...
View ToolConstrained generation library for LLMs. Uses finite state machines to mask invalid tokens during generation. Guarantees...
View ToolInstall the dd CLI and scaffold your first AI-powered app in under a minute.
Getting StartedConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsWhat MCP servers are, how they work, and how to build your own in 5 minutes.
AI Agents
The new wrangler deploy --temporary flag creates ephemeral Cloudflare accounts for AI agents. 60-minute deployments, no...

New benchmark data shows GPT-5.5 hallucinates 86% of the time when it does not know the answer - versus 28% for the open...

The Transformer co-creator leaves Google DeepMind for OpenAI just two years after Google paid $2.7 billion to bring him...

Dan Abramov's explainer on ATProto architecture is making the rounds. The core insight: Bluesky's protocol separates hos...

A deep dive into DuckDB's architecture - columnar storage, vectorized execution, and zero-copy design that lets it compe...

Most developers only know .gitignore, but Git offers two other ignore mechanisms for local workflows and machine-wide pa...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.