LLM / Agent / Generation
AI translation of literary texts is "fine", but readers still prefer human translations
** Yves Ferstler, Adam Podoxin, Ty Brassington, Roman Grundkiewicz, Maite Taboada, Marzena Karpinska
AI translation of literary texts is "fine", but readers still prefer human translations
Authors: Yves Ferstler, Adam Podoxin, Ty Brassington, Roman Grundkiewicz, Maite Taboada, Marzena Karpinska
arXiv ID: 2606.26040
Problem: Do readers experience literary machine translations (MT) as immersively as human translations (HT), and can automatic metrics reliably capture this preference?
Key Methodology:
- 15 avid readers compared 8K-word excerpts from 15 recent novels (French, Polish, Japanese → English) across 30 excerpt-level and 772 chunk-level paired comparisons
- Used an agentic LLM-based pipeline for MT generation, with 2 readers per book and alternating presentation order; readers also performed a "guess which is MT" task
- Released LAIT dataset: 1K reader comments, 2K preference judgments, 7.2K span-level annotations with a supporting evaluation protocol and interface
Key Results:
- Readers preferred HT at excerpt-level (19/30) and more clearly at chunk-level (522/772)
- Readers could only correctly identify the MT version 17/30 times (barely above chance)
- Readers tended to prefer whichever version they believed was human
- MT quality varied more within a single book than HT quality did
- Automatic metrics (including LLM-as-a-judge) failed to recover reader preferences and systematically favored MT
Applied Context: Builders should not rely on LLM-as-a-judge or standard MT metrics to evaluate literary translations - reader studies with immersion-focused protocols are necessary, and a "good enough" MT output can still lose to human translation on narrative flow and clarity.