
TL;DR
Baidu releases Unlimited OCR, an open-source vision-language model that parses 100+ page documents in a single pass without memory blowup. Here's what developers need to know.
Baidu has open-sourced Unlimited OCR, a vision-language model designed to transcribe long documents in a single pass. The project hit the Hacker News front page with 310 points and sparked a surprisingly heated debate about whether OCR is actually a solved problem.
Spoiler: it is not.
The core innovation is architectural. When you feed a 100-page PDF to a typical vision-language model, the KV cache (the model's short-term memory of what it has already transcribed) grows linearly. This means memory consumption explodes and generation slows to a crawl. Developers have historically worked around this by chunking documents page-by-page, processing them separately, and stitching the text back together - a janky approach that loses cross-page context.
Unlimited OCR introduces Reference Sliding Window Attention (R-SWA) to split the model's attention into two paths:
The result: O(1) memory consumption for output generation, regardless of document length.
The model supports two processing modes:
The system implements n-gram repetition blocking (window sizes of 128 for single images, 1024 for multi-page documents) to prevent the model from getting stuck in repetitive output loops.
Requirements:
You can run inference via HuggingFace Transformers directly or deploy an OpenAI-compatible API server via SGLang. The model is available on both HuggingFace (baidu/Unlimited-OCR) and ModelScope.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jun 23, 2026 • 6 min read
Jun 22, 2026 • 5 min read
Jun 22, 2026 • 8 min read
Jun 22, 2026 • 5 min read
The Hacker News discussion immediately split into two camps.
The "OCR is solved" camp pointed to existing vision models:
"OCR has been solved long time ago with vision models. Solutions are consistent, reliable, and stable."
They were quickly corrected by practitioners who work with OCR daily:
"I've been working on Parseur for the last 10 years, and OCR has not been solved yet, let me tell you. OCR still sucks in 2026."
The most nuanced take came from commenters who distinguished between different OCR tasks:
"If you try to OCR hand-filled forms with a fixed structure, traditional OCR models are great. But if you are trying to ingest diverse documents with headings, multi-column layouts, headers and footers, ad space in the middle of your text, etc, vision-llms are a giant step forward."
The "OCR is too expensive" camp raised valid concerns about cost and speed:
"Traditional OCR is faster, cheaper, and much more reliable than LLMs"
But others countered with specific examples where LLM-based OCR works:
"I found that feeding it to Claude Sonnet 4.x via API gave me results that were perfect. No corrections required. So perfect, that Claude was reading along with the story, and actually pointed out a continuity error in the story."
The hallucination concern was raised multiple times. LLM-based OCR can "improve" text by filling in what it thinks should be there rather than what the document actually says:
"A simple example is words that are supposed to be in other languages being automatically translated to English, which ruins the effect"
One commenter reported Unlimited OCR working well for Japanese grammar PDFs: "It has converted about 200 pages in an hour" on a 4090.
The music notation thread was unexpected. A jazz musician derailed the conversation into a discussion of optical music recognition (OMR), which remains far behind text OCR. LLMs understand music theory well enough when described in text, but reading sheet music is still "basically a greenfield for AI wherever you look."
Based on the discussion, here is a rough hierarchy:
For simple, structured forms with clean scans:
For complex documents with mixed layouts:
For multi-page documents where context matters:
For production pipelines:
The paper (arXiv:2606.23050) acknowledges that using an LLM decoder for OCR is a double-edged sword. The language prior helps correct errors and handle degraded inputs, but the accumulated memory load makes long documents impractical. R-SWA is an elegant architectural fix that keeps the benefits while eliminating the scaling problem.
For developers building document processing pipelines, this is another tool in the toolkit. The fact that it is open-source (Apache 2.0) and runnable locally is significant - you can actually iterate on edge cases rather than filing support tickets.
The HN discussion also surfaced a useful tool comparison: Marker, Mistral OCR, Docling, Azure Document Intelligence, Textract, and now Unlimited OCR. If you are in the market for document parsing, run your actual documents through a few of these before committing.
Read next
A YC W25 startup open-sources CADAM, a browser-based tool that converts natural language to parametric OpenSCAD models. HN debate: is text-to-CAD genuinely useful or just another demo?
6 min readEpic Games open-sourced Lore, a centralized version control system designed for binary-heavy game projects. It uses Merkle trees, on-demand file hydration, and native chunked storage to handle terabyte-scale repos that Git struggles with.
7 min readModern LLMs now use MoE routing, mixed attention variants, and fused vision encoders. The simple transformer stack is gone - here's what replaced it and why it matters for developers.
6 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
DeepSeek's reasoning-first model built for agents. First model to integrate thinking directly into tool use. Ships along...
View ToolOpen-source autonomous coding agent inside VS Code. Creates files, runs commands, and can use a browser for UI testing a...
View ToolGoogle's open-source coding CLI. Free tier with Gemini 2.5 Pro. Supports tool use, file editing, shell commands. 1M toke...
View ToolOpen-source terminal agent runtime with approval modes, rollback snapshots, MCP servers, LSP diagnostics, and a headless...
View ToolTrack open-source maintenance signals, release tasks, and repo follow-ups in one dashboard.
View AppShare agent traces with a link. Keep history long enough to find the bug.
View AppTurn API documentation and OpenAPI specs into typed SDK plans and demo checklists.
View AppInstall the dd CLI and scaffold your first AI-powered app in under a minute.
Getting StartedConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsWhat MCP servers are, how they work, and how to build your own in 5 minutes.
AI Agents
Switzerland's fully open foundation model promises transparent training data and EU compliance. The HN crowd has questio...

The new wrangler deploy --temporary flag creates ephemeral Cloudflare accounts for AI agents. 60-minute deployments, no...

Modern LLMs now use MoE routing, mixed attention variants, and fused vision encoders. The simple transformer stack is go...

A YC W25 startup open-sources CADAM, a browser-based tool that converts natural language to parametric OpenSCAD models....

The Transformer co-creator leaves Google DeepMind for OpenAI just two years after Google paid $2.7 billion to bring him...

Epic Games open-sourced Lore, a centralized version control system designed for binary-heavy game projects. It uses Merk...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.