Ornith Is the Open-Source Coding Model Built for Agents, Not Humans

In brief

DeepReinforce released Ornith-1.0 on June 25 under MIT license, purpose-built for AI coding agents working in real terminal and repository environments.
The 9B variant scores 69.4 on SWE-bench Verified, outperforming Google’s Gemma 4-31B (52.0).
Ornith’s own model card warns the models may underperform on non-coding tasks—they are wired for developer pipelines, not general-purpose AI conversations.

DeepReinforce, an AI research lab previously known for CUDA-L1 and the IterX code-agent optimization loop, released Ornith-1.0 late last week—a family of open-source coding models available on Hugging Face in four sizes based on the number of parameters: 9 billion, 31 billion, 35 billion mixture of experts, and a 397 billion mixture-of-experts flagship, all under MIT license with no regional restrictions.

Parameters are basically the number of dials and configurations a model can handle on its training. The more parameters, the more capable a model is. A 9-billion-parameter model is considered small, good enough to run on a good smartphone, but not capable of doing any heavy reasoning task reliably. A 397 billion model is much more capable, but requires some heavy computing, the kind that is not available on consumer hardware.

The lab describes it as “a self-improving family of open-source models specially for agentic coding tasks.” That word—agentic—is doing a lot of work.

Aloha! 🌺 Meet Ornith-1.0, a family of open-source LLMs specialized for agentic coding.

Ornith-1.0 spans the full parameter sizes including 9B Dense, 31B Dense, 35B MoE, and 397B MoE. It achieves state-of-the-art performance among open-source models of comparable size on… pic.twitter.com/7g1rmacLps

— Ornith (@ornith_) June 25, 2026

Most AI that people interact with is conversational: you type, it responds, the exchange ends. Agentic AI is different—it gets a task and takes actions to complete it without a human guiding each step. In a coding context, that means an AI that reads files, runs tests, identifies what failed, fixes the code, and loops again until it’s done.

So Agentic AI means no one needs to be at the keyboard for most of the time. That’s the whole point. This is also the direction where the most commercially relevant progress is happening in 2026—the models that can run unsupervised through 20-step dev workflows are worth more than the ones that write a clean function on request.

However, most large language models are still designed with human feedback in mind.

How Ornith’s brain works

Most AI coding agents are paired with a human-designed harness—a fixed set of rules for how the agent structures its work: when to call a tool, how to handle an error, how to decompose a multi-step problem. Ornith instead “treats the scaffold as a learnable object that co-evolves with the policy.”

Translation: instead of inheriting someone else’s playbook, it develops its own.

During reinforcement learning, each training step happens in two stages. The model first reads the task and proposes a refined strategy for approaching it. Then it uses that strategy to generate a solution.

The reward from the outcome flows back to both stages—so the model is optimized for writing better strategies, not just better code. Do that thousands and millions of times, and task-specific approaches emerge without a human engineering them.

DeepReinforce also takes reward hacking seriously. If the model can write its own training scaffold, it can theoretically write a scaffold that games the verifier—touching a file to make it look like it completed a task without actually doing the work. Three layers of defense block this: the environment and test suite are immutable and outside the model’s reach, a deterministic monitor flags any attempt to access restricted paths or alter verification scripts, and a frozen judge model sits on top of the automated verifier as a veto.

The numbers

The flagship 397 billion parameter model posts 82.4 on SWE-bench Verified—a test where an AI is given a real bug from an open-source GitHub repository and must fix it without seeing the test suite, scored as the percentage of issues it successfully resolves.

That beats Claude Opus 4.7’s 80.8 and DeepSeek-V4-Pro’s 80.6 on the same test. On Terminal Bench 2.1—89 tasks run inside containerized terminal environments ranging from debugging async code to resolving security vulnerabilities, scored by completion rate—it posts 77.5 against Claude Opus 4.7’s 70.3.

Given that SWE-bench contamination concerns have been raised publicly—OpenAI argued earlier this year that models were inflating scores by memorizing benchmark solutions seen during training—Ornith also reports numbers on SWE-bench Pro, a harder version using more diverse, less-leaked codebases scored the same way. The 397 billion model lands at 62.2 there. Meaningfully lower, but still competitive with the field, and still better than Deepseek V4 Pro.

The 9 billion parameter model might be the more interesting data point. It posts 69.4 on SWE-bench Verified—higher than Gemma 4-31B’s 52 and competitive with Qwen 3.5-35B’s 70, despite being 3-4 times smaller.

Who it’s for, and who it isn’t

Ornith-1.0 is explicitly not a general-purpose AI. The model’s own documentation says it may underperform on tasks outside agentic coding. If you want AI to summarize a document, help you write your doctoral thesis, or draft an email, Ornith-1.0 is the wrong pick.

It’s optimized for a narrow problem set: developer pipelines where an AI agent takes a task description, operates inside a code repository or terminal session, and completes multi-step work without intervention. This is a tool that was built for people who are already running agent infrastructure—not for people trying to decide if AI is worth using.

The “beats Claude” headline is real but requires context. As Decrypt reported, every lab is now chasing performance on agentic coding evals, because that’s where the useful performance differences live.

Ornith-1.0-397B does surpass Claude Opus 4.7 on both different coding benchmarks, but Anthropic’s current flagship, Claude Opus 4.8, scores higher. The comparison that holds is within the open-source category, at comparable parameter counts, on coding-specific agent tasks.

For developers building self-hosted coding pipelines, agentic infrastructure, or similar coding-focused work, the small and medium models running on edge hardware may be genuinely useful, but the average Joe may be better looking somewhere else.

Daily Debrief Newsletter

Start every day with the top news stories right now, plus original features, a podcast, videos and more.

Artificial Intelligence#Ornith #OpenSource #Coding #Model #Built #Agents #Humans1782770247

Ornith Is the Open-Source Coding Model Built for Agents, Not Humans

In brief

How Ornith’s brain works

The numbers

Who it’s for, and who it isn’t

Daily Debrief Newsletter

wellnessfitpro

Johnathan DoeCoin

Industry Talk

Newsletter

Related Posts

Strategy Snaps 9-Day Losing Streak as Bitcoin Giant Adopts 'Robust' Capital Framework