
TL;DR
DeepReinforce AI released Ornith-1.0, a family of open-source coding models claiming self-improvement. The HN thread reveals a mix of skepticism and genuine interest - here is what the model actually does and whether the hype holds up.
DeepReinforce AI dropped Ornith-1.0 on GitHub this week, and the Hacker News thread quickly accumulated over 200 points and 40 comments. The headline claims the model family is "self-improving" - a phrase that immediately raises eyebrows in a space where overpromising has become the norm.
The model family ships in four sizes: 9B-Dense, 31B-Dense, 35B-MoE, and 397B-MoE. They are built on top of Gemma 4 and Qwen 3.5 foundations, released under the MIT license. The dense 9B fits on a single 80GB GPU, while the larger MoE variants require multi-GPU tensor parallelism.
Let me be direct: Ornith-1.0 does not improve itself during inference. The weights do not change when you run it. The "self-improving" label refers to the training methodology, not the deployment behavior.
According to DeepReinforce's documentation, Ornith uses a reinforcement learning approach that jointly optimizes two components: the scaffolding that drives rollouts and the solution rollouts themselves. In practical terms, the model learns to generate both its answers and the task-specific harnesses that guide how those answers are produced.
This is different from most RL-for-coding approaches where humans design the evaluation harnesses and the model just learns to produce better solutions within that fixed structure. Ornith learns the structure too.
Simon W caught this distinction immediately in the HN thread:
It doesn't self-improve, that's a misleading headline. As far as I can tell they trained it by running their own reinforcement learning on top of Qwen and Gemma 4 - so the "self-improving" is about their training process, not how you use the weights.
This is the accurate framing. The term "self-improving" describes the training loop, not runtime behavior.
The Hacker News discussion split into three camps.
The skeptics pointed out that this looks like another benchmaxxed fine-tune. One commenter put it bluntly: "These are simply benchmaxxed versions of either Qwen or Gemma 4." Another noted that the model "fails at benchmarks" and "long session tool calls sucks and hallucinate a lot."
The LocalLLM community's reputation came up repeatedly. One commenter observed that "the local LLM community is now teeming with erstwhile crypto and NFT hucksters who've brought the culture of hype from their former communities with them." Whether or not that is fair to everyone building local models, it does explain some of the reflexive skepticism.
The cautiously interested noted that this is the first Qwen fine-tune that has not been immediately rejected by the LocalLLM community. One user reported: "Based on my limited usage, it is good, gives creative solutions to coding problems. I don't expect 9-35B models to one-click create full apps."
Another commenter shared a more specific observation: "From what I personally tested Ornith-1.0 35B is slightly better than Qwen-3.6 35B. The part that I find interesting is that the model is way faster than Qwen3.6 35B. It seems Ornith produce a smaller chain of thought. On my test it can be 3 time faster to produce the answer."
Speed improvements in chain-of-thought models matter. If Ornith produces equally good solutions with less verbose reasoning, that is a real win for practical use.
The critics of benchmarks questioned the evaluation methodology entirely. One commenter noted that the benchmark "ranks Kimi K2.6 and K2.7 Code near the bottom. Both are below Ornith 35B. It ranks Gemma 4 26B much higher than GLM-5.2. The results don't make much sense."
This is a recurring problem in the open-source model space. Benchmarks are gamed, and results that contradict real-world experience are common.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jun 30, 2026 • 8 min read
Jun 29, 2026 • 8 min read
Jun 28, 2026 • 8 min read
Jun 28, 2026 • 9 min read
DeepReinforce published performance numbers across several evaluation suites:
| Model | SWE-Bench Verified | Terminal-Bench 2.1 |
|---|---|---|
| Ornith 9B | 69.4% | - |
| Qwen3.5-9B | 53.2% | - |
| Ornith 35B | - | 64.2% |
| Qwen3.5-35B | - | 41.4% |
| Ornith 397B | 82.4% | - |
| Claude Opus 4.8 | 87.6% | - |
These are substantial claimed improvements over the base models. The 397B MoE variant allegedly approaches Claude Opus 4.8 on SWE-Bench Verified.
Take these with appropriate skepticism. SWE-Bench has become a target for optimization, and models trained specifically on similar distributions tend to overperform on benchmarks while underperforming on novel tasks.
The architecture is a reasoning model - it generates internal chain-of-thought before producing final answers. All variants support a 256K context window and emit well-formed function calls for tool use.
Runtime compatibility is broad: vLLM, SGLang, llama.cpp, and Ollama are all supported. The recommended inference settings are temperature=0.6, top_p=0.95, top_k=20 for typical deployment, though benchmarks used temperature=1.0.
One HN commenter raised the accessibility issue: "Us mere mortals cannot use this" - referring to the 80GB GPU requirement for even the smallest dense model. This is fair criticism. The quantized versions that appeared shortly after launch help somewhat, but the barrier to entry remains high compared to smaller models.
An interesting critique emerged in the thread about testing methodology. One reviewer tested Ornith without tool access and found it "performs poorly in a chat without tools, exhibiting an enthusiasm for hallucination."
Another commenter pushed back:
How is that a serious phrase in '26? Testing a (clearly) agentic model without tool access and expecting it to work is crazy, no? What was he even testing?!
This is the right frame. Agentic coding models are designed to use tools - file systems, interpreters, search. Testing them without tools is like testing a car without wheels. The model may hallucinate tool outputs rather than admitting it cannot execute, but that says more about how it should be deployed than about its fundamental capability.
If you are already running local models for coding tasks and have the hardware, Ornith is worth testing against your actual workloads. The reports of faster chain-of-thought generation are interesting if they hold up.
If you are looking for a drop-in replacement for Claude or GPT-4 for general coding assistance, this probably is not it. The tool-use requirements and hallucination tendencies outside agentic harnesses make it a poor fit for casual chat use.
If you are evaluating open-source coding models for a self-hosted code assistant pipeline, add Ornith to your test matrix alongside Qwen 3.6 and Gemma 4. The proof will be in how it performs on your specific codebase and task distribution, not in benchmark tables.
The "self-improving" framing is marketing. The underlying approach - jointly optimizing solutions and scaffolding - is technically interesting but does not change how you deploy or use the model. Judge it on outputs, not on training methodology claims.
No, not during inference. The "self-improving" label describes the training methodology where the model learns to generate both solutions and the evaluation harnesses that guide those solutions. Once trained, the weights are fixed like any other model.
The dense 9B model requires an 80GB GPU. The larger MoE variants need multi-GPU tensor parallelism. Quantized versions are available for more accessible hardware, but the full-precision models have steep requirements.
Reports are mixed. Some users report it slightly outperforms Qwen 3.6 35B with faster chain-of-thought generation. Others say it hallucinates more in long sessions. Your results will depend on your specific use case.
No. It is designed for agentic use with tool access. Testing it as a chat model without tools produces poor results with frequent hallucinations. Deploy it in an agentic harness with proper tool access.
Read next
Qwen3.6-27B keeps pulling developers back because it sits in the awkward, useful middle: strong enough for real local coding tasks, small enough for serious workstation testing, and cheap enough to benchmark honestly.
8 min readSemgrep's security research team benchmarked LLMs on IDOR vulnerability detection. The open-weight GLM 5.2 beat Claude Code by 7 points at roughly one-sixth the cost.
6 min readAlex Ellis shares real production experience running local LLMs: $12k hardware investment, 2-3 month ROI, and why treating local models as Opus substitutes misses the point entirely.
7 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Open-source AI pair programming in your terminal. Works with any LLM - Claude, GPT, Gemini, local models. Git-aware ed...
View ToolOpen-source AI coding agent for terminal, desktop, and IDE. Works with 75+ LLM providers including Claude, GPT, Gemini,...
View ToolOpen-source terminal agent runtime with approval modes, rollback snapshots, MCP servers, LSP diagnostics, and a headless...
View ToolOpen-source AI code assistant for VS Code and JetBrains. Bring your own model - local or API. Tab autocomplete, chat,...
View ToolTrack open-source maintenance signals, release tasks, and repo follow-ups in one dashboard.
View AppCompare AI coding agents on reproducible tasks with scored, shareable runs.
View AppSchedule jobs in plain English. See what ran, what broke, what's next.
View AppInstall Ollama and LM Studio, pull your first model, and run AI locally for coding, chat, and automation - with zero cloud dependency.
Getting StartedWhat MCP servers are, how they work, and how to build your own in 5 minutes.
AI AgentsStep-by-step guide to building an MCP server in TypeScript - from project setup to tool definitions, resource handling, testing, and deployment.
AI Agents
OpenAI Codex Desktop App: Plan/Goal Modes, Plugins, Multi-Agent Workflows & UI Annotation Demo The video showcases OpenAI’s Codex desktop app, which the creator calls OpenAI’s best product and a prem...

Nimbalyst Demo: A Visual Workspace for Codex + Claude Code with Kanban, Plans, and AI Commits Try it: https://nimbalyst.com/ Star Repo Here: https://github.com/Nimbalyst/nimbalyst This video demos N...

Qwen3.6-27B keeps pulling developers back because it sits in the awkward, useful middle: strong enough for real local co...

Semgrep's security research team benchmarked LLMs on IDOR vulnerability detection. The open-weight GLM 5.2 beat Claude C...

Alex Ellis shares real production experience running local LLMs: $12k hardware investment, 2-3 month ROI, and why treati...

Baidu releases Unlimited OCR, an open-source vision-language model that parses 100+ page documents in a single pass with...

A new project proposes a graphical shell layer for SSH that turns remote servers into browsable desktops. The HN discuss...

A developer fed 266MB of DICOM MRI data to Claude Code Opus for a second opinion on a shoulder diagnosis. The AI disagre...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.