Unlimited OCR: Baidu's Open-Source Solution for Long Document Parsing

Baidu has open-sourced Unlimited OCR, a vision-language model designed to transcribe long documents in a single pass. The project hit the Hacker News front page with 310 points and sparked a surprisingly heated debate about whether OCR is actually a solved problem.

Spoiler: it is not.

What Unlimited OCR Actually Does

The core innovation is architectural. When you feed a 100-page PDF to a typical vision-language model, the KV cache (the model's short-term memory of what it has already transcribed) grows linearly. This means memory consumption explodes and generation slows to a crawl. Developers have historically worked around this by chunking documents page-by-page, processing them separately, and stitching the text back together - a janky approach that loses cross-page context.

Unlimited OCR introduces Reference Sliding Window Attention (R-SWA) to split the model's attention into two paths:

Global Reference: The model maintains full, uncompromised sight of the original document image. Context is never lost.
Local Generation: The model restricts its memory of its own output to a tight, moving window (roughly the last 128 words) and forgets the rest.

The result: O(1) memory consumption for output generation, regardless of document length.

Technical Details

The model supports two processing modes:

Gundam config: base_size=1024, image_size=640, crop_mode enabled - optimized for multi-page documents
Base config: base_size=1024, image_size=1024 - standard processing for single images

The system implements n-gram repetition blocking (window sizes of 128 for single images, 1024 for multi-page documents) to prevent the model from getting stuck in repetitive output loops.

Requirements:

Python 3.12.3
CUDA 12.9
PyTorch 2.10.0
Transformers 4.57.1
An NVIDIA GPU with sufficient VRAM

You can run inference via HuggingFace Transformers directly or deploy an OpenAI-compatible API server via SGLang. The model is available on both HuggingFace (baidu/Unlimited-OCR) and ModelScope.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

VibeThinker-3B: A 3 Billion Parameter Model That Outscores Opus 4.5 on Reasoning

Jun 23, 2026 • 6 min read

Claude Code's Extended Thinking Is a Summary - What That Means for You

Jun 22, 2026 • 5 min read

Codex CLI Needs Resource Budgets, Not Just Token Budgets

Jun 22, 2026 • 8 min read

Codex Logging Bug Can Write Terabytes to Your SSD

Jun 22, 2026 • 5 min read

What HN Is Saying

The Hacker News discussion immediately split into two camps.

The "OCR is solved" camp pointed to existing vision models:

"OCR has been solved long time ago with vision models. Solutions are consistent, reliable, and stable."

They were quickly corrected by practitioners who work with OCR daily:

"I've been working on Parseur for the last 10 years, and OCR has not been solved yet, let me tell you. OCR still sucks in 2026."

The most nuanced take came from commenters who distinguished between different OCR tasks:

"If you try to OCR hand-filled forms with a fixed structure, traditional OCR models are great. But if you are trying to ingest diverse documents with headings, multi-column layouts, headers and footers, ad space in the middle of your text, etc, vision-llms are a giant step forward."

The "OCR is too expensive" camp raised valid concerns about cost and speed:

"Traditional OCR is faster, cheaper, and much more reliable than LLMs"

But others countered with specific examples where LLM-based OCR works:

"I found that feeding it to Claude Sonnet 4.x via API gave me results that were perfect. No corrections required. So perfect, that Claude was reading along with the story, and actually pointed out a continuity error in the story."

The hallucination concern was raised multiple times. LLM-based OCR can "improve" text by filling in what it thinks should be there rather than what the document actually says:

"A simple example is words that are supposed to be in other languages being automatically translated to English, which ruins the effect"

One commenter reported Unlimited OCR working well for Japanese grammar PDFs: "It has converted about 200 pages in an hour" on a 4090.

The music notation thread was unexpected. A jazz musician derailed the conversation into a discussion of optical music recognition (OMR), which remains far behind text OCR. LLMs understand music theory well enough when described in text, but reading sheet music is still "basically a greenfield for AI wherever you look."

The Real Landscape: What to Use When

Based on the discussion, here is a rough hierarchy:

For simple, structured forms with clean scans:

Tesseract or PaddleOCR (also from Baidu) work well
Fast, cheap, deterministic

For complex documents with mixed layouts:

Azure Document Intelligence or AWS Textract are the commercial leaders
About 85% accuracy - "but you have to run a test because they both fail in different ways on the 15%"

For multi-page documents where context matters:

Unlimited OCR is worth testing
Marker (with --force-ocr) gets good results
Mistral OCR 4 (released today - timing)

For production pipelines:

Expect to validate. Every OCR tool fails on edge cases
Build verification into your pipeline, not just blind extraction

Why This Matters

The paper (arXiv:2606.23050) acknowledges that using an LLM decoder for OCR is a double-edged sword. The language prior helps correct errors and handle degraded inputs, but the accumulated memory load makes long documents impractical. R-SWA is an elegant architectural fix that keeps the benefits while eliminating the scaling problem.

For developers building document processing pipelines, this is another tool in the toolkit. The fact that it is open-source (Apache 2.0) and runnable locally is significant - you can actually iterate on edge cases rather than filing support tickets.

The HN discussion also surfaced a useful tool comparison: Marker, Mistral OCR, Docling, Azure Document Intelligence, Textract, and now Unlimited OCR. If you are in the market for document parsing, run your actual documents through a few of these before committing.

Sources

Unlimited-OCR GitHub Repository
Hacker News Discussion
arXiv Paper: 2606.23050
Unsloth Documentation for GLM models (related model tooling)

What Unlimited OCR Actually Does

Technical Details

VibeThinker-3B: A 3 Billion Parameter Model That Outscores Opus 4.5 on Reasoning

Claude Code's Extended Thinking Is a Summary - What That Means for You

Codex CLI Needs Resource Budgets, Not Just Token Budgets

Codex Logging Bug Can Write Terabytes to Your SSD

What HN Is Saying

The Real Landscape: What to Use When

Why This Matters

Sources

Adam (YC W25): Open Source AI CAD That Generates OpenSCAD from Text

Epic Games Releases Lore: A Version Control System Built for Game Development

LLM Architectures Got Complicated Fast

Related Tools

DeepSeek V3.2

Cline

Gemini CLI

DeepSeek-TUI

Apps from Developers Digest

Maintainer Dashboard

TraceTrail Plus

API Docs Kit

Related Guides

Getting Started with DevDigest CLI

Claude Code Setup Guide

MCP Servers Explained

Related Posts

Apertus: Europe's Answer to AI Sovereignty - and Why HN Is Skeptical

Cloudflare Now Lets AI Agents Deploy Workers Without Signup

LLM Architectures Got Complicated Fast

Adam (YC W25): Open Source AI CAD That Generates OpenSCAD from Text

Noam Shazeer Joins OpenAI After Two Years Back at Google

Epic Games Releases Lore: A Version Control System Built for Game Development

Get Smarter About AI Dev

What Unlimited OCR Actually Does

Technical Details

VibeThinker-3B: A 3 Billion Parameter Model That Outscores Opus 4.5 on Reasoning

Claude Code's Extended Thinking Is a Summary - What That Means for You

Codex CLI Needs Resource Budgets, Not Just Token Budgets

Codex Logging Bug Can Write Terabytes to Your SSD

What HN Is Saying

The Real Landscape: What to Use When

Why This Matters

Sources

Adam (YC W25): Open Source AI CAD That Generates OpenSCAD from Text

Epic Games Releases Lore: A Version Control System Built for Game Development

LLM Architectures Got Complicated Fast

Related Tools

DeepSeek V3.2

Cline

Gemini CLI

DeepSeek-TUI

Apps from Developers Digest

Maintainer Dashboard

TraceTrail Plus

API Docs Kit

Related Guides

Getting Started with DevDigest CLI

Claude Code Setup Guide

MCP Servers Explained

Related Posts

Apertus: Europe's Answer to AI Sovereignty - and Why HN Is Skeptical

Cloudflare Now Lets AI Agents Deploy Workers Without Signup

LLM Architectures Got Complicated Fast

Adam (YC W25): Open Source AI CAD That Generates OpenSCAD from Text

Noam Shazeer Joins OpenAI After Two Years Back at Google

Epic Games Releases Lore: A Version Control System Built for Game Development

Get Smarter About AI Dev