Multimodal / Agent / Robotics
DataEvolver: Self-Evolving Multi-Agent Data Construction for Text-Rich Image Generation
** Siyu Yan, Yizhen Gao, Yilin Wang, Dongxing Mao, Alex Jinpeng Wang (CSU, HKUST)
DataEvolver: Self-Evolving Multi-Agent Data Construction for Text-Rich Image Generation
Authors: Siyu Yan, Yizhen Gao, Yilin Wang, Dongxing Mao, Alex Jinpeng Wang (CSU, HKUST)
arXiv ID: 2606.31537
Problem: Existing text-rich image data pipelines follow a static crawl-filter-freeze paradigm that discards rejected samples, wasting failure signals (OCR errors, semantic mismatches) that could inform later construction rounds.
Key Methodology:
- Four-agent closed-loop framework: Retriever (candidate collection), Verifier (quality scoring + rejection causes), Critic (summarizes round-level feedback into natural-language semantic feedback), Generator (targeted synthetic completion for under-covered regions)
- Treats data construction as feedback-driven policy evolution - rejection patterns are converted into reusable semantic feedback that revises retrieval queries, generation prompts, and an experience memory
- Maintains an experience library of high-value feedback and effective query patterns, updated each round to avoid redundant failures
Key Results:
- At the 0.75M scale on PixArt-α, DataEvolver improves OCR-F1 over the strongest baseline by 85.3% on TextScenesHQ (4.56 → 8.45) and 35.3% on LongTextBench (6.71 → 9.08)
- Benefits transfer to Show-o2: F1 from 0.19 → 0.45 on TextScenesHQ and 0.27 → 0.44 on LongTextBench
- Ablation: removing the Critic drops F1 from 1.78 → 1.01; removing the Generator drops from 1.78 → 1.40 (0.1M scale)
- Critic increases mean OCR confidence of accepted samples from 0.861 → 0.938, and high-confidence samples (≥0.90) rise from 29.1% → 81.1%
Applied Context: For builders, DataEvolver shows that a self-improving data pipeline - where rejected samples become actionable feedback rather than waste - directly improves text rendering quality in generated images, without modifying the generator architecture. This feedback-driven construction paradigm is general and could be applied to other multimodal data domains.