Autonomous Scientific Discovery via Iterative Meta-Reflection

Authors: Bingchen Zhao, Sara Beery, Oisin Mac Aodha (University of Edinburgh, MIT)

arXiv ID: 2607.01131

Problem: Current AI-driven scientific discovery systems are constrained to predefined search spaces or require externally supplied research questions, preventing true open-ended inquiry where a system autonomously explores raw multimodal data without guidance.

Key Methodology:

DiscoPER framework - an LLM-powered agent that iteratively proposes hypotheses as executable Python code, runs statistical tests on training/held-out data, and only accepts claims that pass both effect size (|δ| ≥ 0.2) and significance (p ≤ 0.05) thresholds on both splits.
Reflect module - a second-order meta-reasoning mechanism that periodically analyzes the system's own accumulated accepted and rejected claims to detect gaps, confounds, and compound hypotheses, producing structured guidance that redirects subsequent exploration toward under-explored regions of the hypothesis space.
Multimodal tool use - the system can query vision language models to extract visual attributes from images (e.g., habitat type, canopy density), enabling hypotheses beyond what tabular metadata alone can support.

Key Results:

On iNatDisco-800 (9 ecological patterns from peer-reviewed literature): 8/9 patterns recovered with a 72.7% hypothesis support rate, versus 3/9 for guided baselines (HeurekaBench, ExperiGen) and at most 1/9 for classical causal discovery methods (PC, NOTEARS, GPT-4 BFS).
On iNatDisco-50K (12 patterns across 9,776 species): 8/12 patterns recovered with a 74.2% support rate.
Ablation without Reflect drops recall from 8/9 → 7/9 on iNatDisco-800 and 8/12 → 6/12 on iNatDisco-50K, with hypothesis diversity collapsing to 92% simple pairwise comparisons (vs. 69% with Reflect).
Counterfactual validation (iNatDisco-800-CF) confirms discoveries are data-grounded rather than memorized: the system correctly reports reversed ecological patterns present in the modified tabular data.

Applied Context: For builders, DiscoPER demonstrates that combining code-driven hypothesis generation with reflective meta-analysis enables LLM agents to autonomously surface validated, non-trivial patterns from raw multimodal data without any pre-specified research question. This pattern - statically-grounded proposals + periodic self-reflection over accumulated findings - is directly applicable to any domain where you want agents to conduct open-ended exploration of large datasets.