Agentic Search Works Best When It Writes Queries, Not Answers

Developers Digest•May 2, 2026•8 min read

AI Search Agents RAG Search Data Extraction

TL;DR

SNEWPAPERS is a useful Show HN signal: the strongest agentic search products do not replace search results with prose. They teach the agent to operate a real search system.

One of the more interesting Show HN launches today was not a coding tool. It was SNEWPAPERS, a historical newspaper archive built around full-text extraction, semantic search, and an agentic search assistant.

The author says they extracted more than 600,000 newspaper pages from the Chronicling America collection, about 5TB of source material, then built a pipeline for layout segmentation, classification, OCR, semantic indexing, and query assistance.

That is a big data project. But the product lesson is more specific:

Agentic search works best when the agent writes queries, not when it replaces the search system with a paragraph.

That sounds small. It is not.

Most "AI search" products collapse three jobs into one chat box:

Understand what the user wants.
Retrieve the right evidence.
Write the final answer.

That is fine for simple lookups. It gets shaky when the corpus is messy, historical, huge, domain-specific, or full of source ambiguity.

SNEWPAPERS points at a better pattern: let the agent help the user operate the search system, then keep the actual search interface and source documents visible.

The hard part is not the chat

Historical newspapers are a brutal corpus.

The source pages have columns, broken scans, advertisements, small headlines, multiple article fragments, OCR errors, page furniture, different eras of typography, and weird layout conventions. A keyword search over raw scans produces noise. A pure semantic search can miss exact names, dates, places, or spellings. A chat-only abstraction can hide too much evidence.

The SNEWPAPERS Show HN post describes a multi-model pipeline:

layout processing
OCR
LLM-based classification
heuristics for segmentation
OpenSearch
Postgres
semantic search
an agentic search tool that writes useful queries

That last part is the product move.

The assistant is not just answering from a black box. It helps users formulate searches, then the user can inspect the saved queries and continue exploring the results.

That keeps the archive in the loop.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

HN pushed on the right UX problem

The Hacker News comments were positive but practical. One commenter who works with complicated datasets said the hard part is often UI: even experienced search people struggle to see how they would use a large corpus until they can try a focused slice. They suggested making a small public segment immediately searchable without registration, such as one year of Olympic coverage.

That is the right critique.

When the dataset is this large, "we have the archive" is not enough. The product has to give users a starting wedge.

For agentic search, the first interaction matters more than the model quality. A user needs to see:

what the agent searched for
why that query was generated
what filters were applied
which results came back
where the source evidence lives
how to refine the next query

If the agent hides that trail, the user gets a confident answer and no search skill. If the agent exposes the trail, the user gets leverage.

Query-writing agents are underrated

There is a pattern here that applies beyond newspapers.

For domain search, the best agent is often a query planner:

User intent:
Find early newspaper coverage of bank runs in rural towns.

Agent output:
Search query:
  ("bank run" OR "run on the bank" OR "depositors rushed")
Filters:
  year: 1890-1935
  publication type: local newspaper
  section: news
Follow-up:
  Search by bank name once candidate towns appear.

That is more useful than a prose answer if the user is doing real research.

The agent turns vague intent into a search strategy. The search engine does retrieval. The UI shows sources. The user keeps judgment.

This division of labor is cleaner than "chat with all documents." It also scales better across messy corpora because each layer can be improved independently.

OCR can get better. Layout extraction can get better. The search index can get better. The query planner can get better. The result UI can get better. None of those improvements require pretending the model is the archive.

The RAG lesson

RAG builders should pay attention.

A lot of RAG apps are designed as answer machines. The user asks a question. The system retrieves chunks. The model writes an answer. Maybe citations appear at the bottom.

That is useful for support docs and narrow knowledge bases.

For exploratory research, it is often the wrong primitive.

Exploratory search needs:

saved queries
facets
date ranges
entity filters
source previews
side-by-side comparison
result clustering
provenance
follow-up search paths

An agent can help drive those controls. It should not erase them.

SNEWPAPERS is interesting because the assistant sits on top of an actual search product. It can help you ask better questions without making the result page irrelevant.

That is the architecture I would copy.

The product risk

The risk is onboarding.

Large archives need a fast proof moment. If users have to register, invent a query, understand the corpus, and interpret a result set before they feel the product, many will leave before the agent can help.

The HN suggestion of public slices is strong because it narrows the first run:

one theme
one date range
one preloaded query
one visible search trail
one obvious refinement

For an archive product, that is not marketing fluff. It is core UX. The product has to teach users what kind of questions the corpus can answer.

Agentic search can help, but only if it starts from concrete examples.

My take

The durable idea in SNEWPAPERS is not "AI reads old newspapers."

It is that agentic search should make the underlying search system more usable.

The agent should translate intent into queries, propose filters, preserve search history, surface source evidence, and help users iterate. The answer can come later. In serious research products, the trail is often more valuable than the summary.

This is the same pattern developers should use in internal tools, legal search, enterprise knowledge bases, observability, security investigations, and research assistants.

Do not make the model pretend to be the database.

Teach it to operate the database well.

Suggest an editSave

Discuss this article on Twitter/X

Developers Digest

Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.

300+ videos30K+ GitHub stars50+ articles

Subscribe YouTube GitHub Twitter/X