AI translation of literary texts is "fine", but readers still prefer human translations

Authors: Yves Ferstler, Adam Podoxin, Ty Brassington, Roman Grundkiewicz, Maite Taboada, Marzena Karpinska

arXiv ID: 2606.26040

Problem: Do readers experience literary machine translations (MT) as immersively as human translations (HT), and can automatic metrics reliably capture this preference?

Key Methodology:

15 avid readers compared 8K-word excerpts from 15 recent novels (French, Polish, Japanese → English) across 30 excerpt-level and 772 chunk-level paired comparisons
Used an agentic LLM-based pipeline for MT generation, with 2 readers per book and alternating presentation order; readers also performed a "guess which is MT" task
Released LAIT dataset: 1K reader comments, 2K preference judgments, 7.2K span-level annotations with a supporting evaluation protocol and interface

Key Results:

Readers preferred HT at excerpt-level (19/30) and more clearly at chunk-level (522/772)
Readers could only correctly identify the MT version 17/30 times (barely above chance)
Readers tended to prefer whichever version they believed was human
MT quality varied more within a single book than HT quality did
Automatic metrics (including LLM-as-a-judge) failed to recover reader preferences and systematically favored MT

Applied Context: Builders should not rely on LLM-as-a-judge or standard MT metrics to evaluate literary translations - reader studies with immersion-focused protocols are necessary, and a "good enough" MT output can still lose to human translation on narrative flow and clarity.