Leanstral 1.5: Mistral's Open Theorem-Proving Model Hits 100% on miniF2F

TL;DR
Mistral releases Leanstral 1.5, an Apache-2.0 licensed 119B parameter model (6B active) for Lean 4 theorem proving that saturates miniF2F and achieves SOTA on FATE benchmarks.
Mistral has released Leanstral 1.5, an open-weight model specialized for formal theorem proving in Lean 4. The headline numbers are striking: 100% on the miniF2F benchmark (both validation and test sets), 587 out of 672 problems solved on PutnamBench, and state-of-the-art results on the FATE-H and FATE-X evaluation suites.
The model is licensed Apache 2.0 and weighs in at 119B total parameters with only 6B active - a sparse mixture-of-experts architecture that makes it runnable on consumer hardware while maintaining frontier-level performance on formal proof tasks.
Last updated: July 4, 2026
What Leanstral 1.5 Does
Leanstral operates in two specialized environments designed for Lean 4 development:
Multiturn Proof Environment: The model receives theorem statements, submits proof attempts, receives compiler feedback, and iteratively refines its approach until the proof compiles successfully. This mirrors how human mathematicians work with proof assistants - write, get errors, fix, repeat.
Code Agent Environment: Beyond pure proving, Leanstral can function as a development agent - editing files, running bash commands, and using the Lean language server for real-time inspection of goals and type errors. This is closer to how developers actually interact with Lean in practice.
The practical result is a model that can take a theorem statement and, given sufficient token budget, produce a machine-checked proof without human intervention.
The Benchmark Claims
Mistral's published numbers:
| Benchmark | Leanstral 1.5 | Notes |
|---|---|---|
| miniF2F (validation) | 100% | Full saturation |
| miniF2F (test) | 100% | Full saturation |
| PutnamBench | 587/672 | At 4M token budget |
| FATE-H | 87% | State-of-the-art |
| FATE-X | 34% | State-of-the-art |
| FLTEval | Surpasses Claude Opus | At 1/7th the cost |
The miniF2F saturation is significant because this benchmark has been a standard evaluation for theorem-proving systems. Reaching 100% means the benchmark is no longer useful for differentiating models on this task - Leanstral has effectively solved it.
PutnamBench measures performance on competition-level mathematics problems. Solving 87% (587/672) with a 4M token budget demonstrates strong test-time scaling - the model gets better with more compute.
What HN Is Saying
The Hacker News discussion generated substantive debate about the practical implications and some skepticism about the marketing claims.
The bug-finding example drew fire. Mistral highlighted that Leanstral found an overflow bug in the datrs/varinteger Rust library - "an edge case that testing and fuzzing would typically miss." Multiple commenters pushed back hard on this characterization. One pointed out that any property-based testing system invented since 1980 would explore boundary values like U64.MAX. Another reproduced the bug in seconds using proptest.
The consensus view: the bug was real, but calling it something "testing would typically miss" overstates the case. Fuzz testing with boundary value exploration would catch this routinely. The value of formal verification is proving absence of bugs, not finding obvious ones that good testing would catch anyway.
An OpenAI employee weighed in. One commenter (disclosing they work at OpenAI) ran GPT-5.5 High on the same varinteger repository and found the identical bug. Their point: this particular bug was not tricky; the repository simply lacked attention. The interesting question is whether Leanstral can prove properties that LLM-based bug finding cannot.
The comparison chart timing is awkward. The article compares Leanstral to models from "half a year ago" - several generations behind in the current pace of releases. Commenters noted this is a familiar pattern in benchmark marketing: compare to a snapshot of competitors rather than current versions.
The size efficiency is genuinely impressive. At 6B active parameters (119B total with sparse activation), Leanstral is dramatically smaller than the models it outperforms on these benchmarks. Several commenters noted this is the real story - not that it beats large models, but that it does so at 1/50th the active parameter count.
European AI labs are finding their niche. Some discussion touched on Mistral's strategy of targeting specialized domains (OCR, theorem proving) where they can achieve frontier performance without competing head-to-head with OpenAI and Anthropic on general capabilities. This is pragmatic: France has strong historical expertise in formal methods (Coq, OCaml ecosystem), and Mistral is leveraging that heritage.
Newsletter
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools, delivered free every week.
From the archive
Agent Studio: Authoring the Roles, Not Just the Knowledge
Jul 3, 2026 • 9 min read
App Builder: From a Prompt to a Working App You Can Watch Run
Jul 3, 2026 • 8 min read
One Endpoint, Every Capability: A Reference Architecture for Progressive Disclosure
Jul 3, 2026 • 10 min read
Best AI Agent Memory Providers in 2026: Mem0 vs Zep vs Letta vs Cloudflare
Jul 2, 2026 • 11 min read
Real-World Bug Finding
Beyond benchmarks, Mistral claims Leanstral discovered "5 previously unknown bugs across 57 repositories tested." The most interesting was the varinteger overflow:
// On input Std.U64.MAX, the expression (value + 1) overflowed
// Crashes in debug mode, silent corruption in release mode
The bug was filed as datrs/varinteger#8 a week before the Leanstral announcement. The library is small (about 1k downloads/day on crates.io) and hadn't been touched in 8 years - exactly the kind of low-attention code where automated verification adds value.
The broader point: formal verification tools are not primarily about finding bugs that testing misses. They're about proving properties hold for all inputs, which testing fundamentally cannot do. The bug-finding framing is easier to market, but it undersells the actual capability.
How Developers Can Use It
Leanstral 1.5 is available through:
- Mistral Vibe - Free API endpoint
- Hugging Face - Downloadable weights
- OpenATP - An open-source Python package for automated theorem provers that supports Leanstral natively (GitHub)
The practical workflow involves writing Lean 4 code with theorem statements, then using Leanstral to generate proofs. The model integrates with the Lean language server, so it can inspect intermediate proof states and adjust its approach based on type errors.
For developers new to Lean 4, the learning curve is real but manageable. One HN commenter reported going from zero knowledge to productive Lean 4 development in six months, heavily assisted by LLMs (including but not limited to Leanstral). The key insight: you need to understand the axioms and theorem statements you're trying to prove, but the model can handle much of the proof construction machinery.
The Bigger Picture: Verified AI Code
The interesting application is not mathematical theorem proving - it's using Lean 4 as a target for verified code generation.
Several commenters discussed using Lean 4 as:
- A metaprogramming framework that lowers to other languages (C++, Rust, Haskell) with provable correspondence
- A tool for describing state machines and protocols with formal correctness guarantees
- A GPU kernel compiler where tiling and scheduling properties can be formally verified
One commenter reported using Lean 4 bolted to io_uring for systems programming, with benchmarks that outperform nginx on reverse proxy workloads. The combination of a proof-capable language with competitive runtime performance opens possibilities that traditional formal methods tools (slow, academic) could not reach.
The thesis: as LLM-generated code increases, the need for verification increases proportionally. If humans are no longer reviewing every line, machine-checkable correctness proofs become more valuable. Leanstral points toward a workflow where LLMs write code and other LLMs (or the same LLM) prove properties about it.
Limitations and Caveats
Training data uncertainty. The model's performance on specific repositories may reflect training data contamination rather than generalization. This is difficult to rule out.
Benchmark saturation. 100% on miniF2F is impressive, but it means the benchmark is exhausted. Future evaluations will need harder problems.
Practical adoption barriers. Most developers do not write Lean 4. The path from "LLM can prove theorems" to "my production code has machine-checked properties" involves substantial tooling and process changes.
Comparison to non-specialized models. The FLTEval comparison to Claude Opus is interesting, but Opus is a general-purpose model. The more relevant comparison would be to other specialized theorem provers, which the release does not address.
Why This Matters for Developers
Short term: if you work with Lean 4 or are interested in formal verification, Leanstral 1.5 is the best open-weight option available. The Apache 2.0 license means you can integrate it into commercial tooling without restrictions.
Medium term: the combination of small active parameter count and strong performance suggests specialized models will remain competitive against larger general-purpose models for specific domains. This has implications for how teams choose AI tooling - domain-specific may beat one-size-fits-all.
Long term: the vision of LLM-generated code with machine-checked correctness proofs is getting more practical. Leanstral is a step toward workflows where code and proofs are generated together, reducing the gap between "it compiles" and "it's correct."
Sources
- Mistral Leanstral 1.5 announcement
- Hacker News discussion
- datrs/varinteger repository
- OpenATP - open-source automated theorem prover
FAQ
What is Leanstral 1.5 and what can it do?
Leanstral 1.5 is Mistral's open-weight model for formal theorem proving in Lean 4. It can take theorem statements, generate proofs, interact with the Lean compiler for feedback, and iteratively refine proofs until they pass verification. It achieves 100% on the miniF2F benchmark and state-of-the-art results on FATE evaluations.
How many parameters does Leanstral 1.5 have?
The model has 119B total parameters but only 6B active due to its sparse mixture-of-experts architecture. This makes it runnable on consumer hardware while maintaining strong performance on theorem-proving tasks.
Can Leanstral 1.5 find bugs in code?
The model can identify bugs by attempting to prove properties about code and failing when those properties don't hold. Mistral claims it found 5 previously unknown bugs across 57 repositories. However, HN commenters noted that the highlighted example (an overflow bug) would have been caught by standard property-based testing or fuzzing.
Is Leanstral 1.5 open source?
Yes, the model is released under the Apache 2.0 license. Weights are available on Hugging Face, and the model is accessible through Mistral's Vibe API. This allows commercial use without restrictions.
Read next
Godot Bans AI-Authored Code Contributions - What It Means for Open Source
The Godot Foundation has established a policy banning autonomous AI agent code and substantial AI-generated contributions, citing reviewer burnout and concerns about maintainer mentorship.
6 min readEpic Games Releases Lore: A Version Control System Built for Game Development
Epic Games open-sourced Lore, a centralized version control system designed for binary-heavy game projects. It uses Merkle trees, on-demand file hydration, and native chunked storage to handle terabyte-scale repos that Git struggles with.
7 min readOuter Shell: A Graphical Desktop for Your Remote Server via SSH
A new project proposes a graphical shell layer for SSH that turns remote servers into browsable desktops. The HN discussion digs into architecture choices, the terminology debate, and whether this solves a real problem.
8 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.









