Cliff-like decline! Even the most powerful AI can't handle long-term development: the more code piled up, the faster the system crashes.

2026-04-05 22:45:21

If you want to trade stocks, just rely on the Jin Qilin analyst research reports—authoritative, professional, timely, comprehensive—helping you uncover high-potential themes and opportunities!

（Source: DeepTech）

Write a function—AI is almost unbeatable; but why does maintaining a system make AI start to collapse?

Today, artificial intelligence has already entered the “second half.” As AI programming capabilities keep improving, products like OpenClaw are gradually emerging, and “CLI everything” is becoming a reality—meaning AI doesn’t need to operate a computer; instead, it turns every interface into a command-line interface (CLI), with one skill after another transforming into a software function.

Now, an Agent is no longer just a conversational tool for executing a single task—it is evolving into a system for long-term operations, interacting with the real world, and carrying out complex tasks. However, a new problem has emerged: during continuous evolution, can AI keep adapting to new environments and maintain stable development capability?

Tencent’s “CEO/President’s Office” Chief AI Scientist Yao Shunyu mentioned in a blog titled “The Second Half” that real programming tasks are continuously dependent rather than independent and parallel. Yet, academia currently has no such benchmark to evaluate the capabilities AI needs in that scenario, and even lacks the courage to break the long-accepted assumption that tasks are independent—an assumption widely used to simplify problems.

Recently, a joint team from the University of Southern California, the University of California, Riverside, Stanford University, Princeton University, OpenHands, and others released a brand-new evaluation benchmark, EvoClaw, proposing a new solution to the above issues. The research team extracted high-quality code evolution histories from open-source projects, enabling the Agent to complete dozens of interdependent functional iterations in sequence within the same code repository.

The results show that top AI performs exceptionally well on independent evaluation tasks (80%+), but once it enters a real scenario over a long time horizon, even Claude Opus 4.6—despite having the highest overall score—only achieves 38.03%. This means AI is prone to deviating from the right track when executing tasks with higher degrees of freedom, and there remains a significant gap from truly being able to handle long-cycle, continuous software evolution work.

（Source: arXiv）

This study reveals that in long-term evolution, AI easily falls into a snowballing technical debt trap. Even though it can continuously add new features, it cannot control the accumulation of回归 errors, ultimately causing the system to run out of control. This also implies that AI programming is shifting from writing code to system governance.

The related paper is titled “EvoClaw: Evaluating AI Agents on Continuous Software Evolution,” and was recently published as a preprint on the arXiv website [1].

Figure | Related paper (Source: arXiv)

Why do existing AI programming evaluations and real-world experience diverge—where exactly is the problem?

Why do the top models that score high in independent evaluations collectively fail in EvoClaw? The root cause is that the evaluation paradigm has changed.

In prior research, mainstream programming evaluation benchmarks (benchmark) mostly focus on independent tasks: given an issue or a pull request (PR), the model completes the fix on a static code snapshot, and the verification passing means the evaluation is completed.

But there is a gap that cannot be ignored between past benchmark results and real development capability: a static environment is a relatively ideal state, while the real environment is more complex and dynamic. Over time, even a small bug from months ago can, after version iterations, grow larger like a snowball, eventually causing the system to crash.

（Source: arXiv）

The first author of the paper, Deng Gangda, a PhD student at the University of Southern California, told DeepTech that “the current commit and release granularity is either too granular or too coarse. Therefore, these development histories cannot reflect the process of software evolution.”

Figure | Deng Gangda (Source: interviewee)

For the first time, the research team introduced the time dimension into the evaluation of AI programming capability. They adopted a brand-new hierarchy—Milestones—to reconstruct the history of software evolution, creating functional units that can preserve semantic completeness and also retain evolution dependency relationships. It requires the AI to complete multiple functional units in sequence on the same codebase, so that it not only preserves the output of each step but also becomes the starting point for the next step.

（Source: arXiv）

To support extracting high-quality software evolution histories from large collections of open-source code repositories, the researchers, based on the strong capabilities of top-tier AI, proposed an Agent-driven automated pipeline called DeepCommit. For the first time, it reconstructs messy Git development records into a verifiable, functionally cohesive Milestone task dependency graph (Milestone DAG) and builds an evaluation environment for each Milestone. It mainly includes three stages: Git history preprocessing, Agent-driven DAG construction, and Milestone environment configuration and verification.

In practice, reconstructing an Agent’s historical evolution using Milestones is not easy, because it’s not just about constructing a static DAG that is purely observable, but also about producing a sequence of executable evaluation environments, while ensuring correctness even as evolution dependency changes.

This means that when you disrupt the overall order of commits and regroup and reconnect them, you may run into commits that cannot be applied, interfaces that don’t align, and massive compilation errors. To address this, the researchers designed an iterative repair loop: the Agent proactively analyzes the error logs and dynamically modifies the Dockerfile to ensure executability.

More importantly, it supplements the implicit dependencies that were missed based on the original DAG—by adjusting the sequencing constraints of Milestones—so that interface conflict issues can be resolved properly. After repeated iterations, they ultimately achieved correct collection of 87.1% of the original test cases.

“Compared with a single programming task scenario, stable, reliable, and effective long-horizon autonomous programming is a more cutting-edge research hotspot. For example, Anthropic and OpenAI have clearly stated that they have shifted their focus to training long-horizon programming capabilities.” Deng Gangda said.

Figure | DeepCommit pipeline architecture diagram (Source: arXiv)

The researchers compared the evolution graphs automatically generated by DeepCommit with the manual annotations by human experts. What surprised them was that the two used different organizational logics and complemented each other.

Specifically, in human experts’ Milestones, they are usually within a local time window: they first define the topic and then reorganize the commits, which is a top-down semantic decomposition; DeepCommit, to guarantee absolute accuracy, reconstructs the software evolution storyline from the dependency relationships between commits, using a bottom-up approach. It places greater emphasis on topological structure and execution constraints.

For evaluation purposes, this precisely shows that DeepCommit’s key lies in extracting an executable and verifiable Milestone structure from software code development history. From the results, DeepCommit can filter out high-quality Milestone tasks suitable for evaluation, and they are executable and verifiable in real environments, providing assurance for evaluation reliability.

Once it enters real development, why do model scores “halve” collectively?

EvoClaw covers five mainstream languages: Python, Java, Go, Rust, and TypeScript. The selected projects span the longest real development cycle of up to 750 days.

Regarding evaluation metrics, the research team did not use a simple pass rate. Instead, they introduced two more core dimensions—the F1-weighted Recall and Precision as each Milestone’s score. Recall is used to measure functional completeness, while Precision captures how much the model breaks existing code when adding new functionality.

The research team tested various frameworks and model combinations such as Claude Code and OpenHands. The results show that in independent evaluations, scores for top models are generally 80%-90%, but after running the EvoClaw benchmark tests, they all drop sharply. Among them, Claude Opus 4.6, which scored the highest, only achieves 38.03%.

Figure | EvoClaw main experimental results (Source: arXiv)

GPT 5.3 Codex achieves a combined score of 28.88%, second only to Opus 4.6. By repository, GPT 5.3 Codex performs weaker on two Rust projects (Nushell, ripgrep), while in the other repositories it can approach or even exceed Opus 4.6. In terms of complete resolution rate, even the highest scorer, Gemini 3 Pro, is only 13.37%, and most of the correctly implemented work is on tasks with no prior dependencies.

It is reported that the researchers kept the overall overhead within a reasonable range. For example, with Claude Opus 4.5, the cost of a full evaluation is about 500 USD. Kimi K2.5 and Gemini 3 Flash are within 50 USD, and smaller models have even lower overhead.

（Source: arXiv）

So, if you give the model a longer development window, will it eventually be able to 100% finish the project?

The study gives a negative answer: no matter how long the development window is, all models’ performance will ultimately hit a “ceiling.” The later the task is executed in the sequence, and the deeper it is in the DAG hierarchy, the lower the scores and resolution rates. The saturated-function extrapolation results show that even for the optimal Opus 4.6, the cumulative score is capped at a sub-linear asymptote around 45%.

“Although Opus 4.6 is mentioned on Anthropic’s official website as performing better than 4.5 on long-horizon tasks, no detailed evaluation metrics are provided. EvoClaw verifies their claim from another angle,” said Deng Gangda.

In addition, the experiments also reveal significant differences among different model families. Specifically, the performance of Claude versus GPT in continuous evolution scenarios improves steadily with version updates. Among them, Opus 4.6 demonstrates the best system maintenance performance on long-horizon programming; GPT 5.3 ranks second because its poor performance on the Rust dataset drags down its score.

（Source: arXiv）

Comparatively surprisingly, the Gemini family shows a completely different trend: from 3 Flash to 3 Pro to 3.1 Pro, each generation starts earlier and performs better in the early stage, but its long-range performance shows almost no significant improvement. Deng Gangda explained: “The obvious decline in Gemini’s long-horizon run performance means that it not only worsens instruction-following, increasingly disregarding the requirements of Software Requirements Specifications (SRS), but also lacks maintenance for the constructed software system.”

When the researchers further broke down the overall scores into Recall and Precision, a more interesting phenomenon appeared: Recall shows a nearly continuously increasing trend, approaching linear growth. This means that even if the codebase becomes more and more chaotic and increasingly fragile, the Agent is still good at implementing the newly given target functions.

The real bottleneck lies in Precision: the Agent finds it hard to maintain existing systems; the speed at which regression errors accumulate exceeds its ability to fix those problems, and this is exactly the fundamental reason why long-term development ultimately stalls.

Figure | Left: error chain schematic; Right: error chain distribution (Source: arXiv)

To better understand the fundamental reason why models lose control during iteration, the research team proposed an analysis framework for error chains. They tracked each test from the first time an error occurred, and observed whether the errors were inherited, spread, skipped, or fixed in subsequent Milestones.

The results show that the rate at which new problems arise does not accelerate. The model can even substantially passively repair some historical errors, but the accumulation rate of preceding errors far exceeds the repair rate, ultimately leading to a “technical debt bankruptcy.”

Provide a general evaluation for debugging AI Harness

Recently, there has been a very hot concept called “Harness Engineering,” aiming to configure the entire software development process into an environment suitable for Agent involvement. The EvoClaw benchmark provides such a general playground for evaluating long-horizon code evolution, making it suitable for debugging the AI Harness framework.

For example, for the failure cases mentioned in this study, if the Agent suddenly shows very proactive iteration or keeps editing and verifying, it likely means the Agent encountered difficulties. In such situations, by constructing guardrails at the corresponding locations, you can discover problems early and enable timely human intervention, thereby improving efficiency.

Since the model’s architecture gives Agents a general property of being much better at implementing new features than maintaining long-standing old ones, will this in the future lead to new software forms and development modes?

For example, software will place greater emphasis on flexibility and compatibility, and more reliable large-scale refactoring and restructuring; or it will become even more “disposable,” with the specific business logic generated in real time and not requiring maintenance, focusing instead on strengthening reusable components and infrastructure.

The research team believes that by loosening constraints on software quality appropriately in development modes, you can reduce the number of human interventions in exchange for greater throughput, ultimately accelerating software iteration.

Deng Gangda pointed out that “this study proves that we are on a correct path—AI’s long-term programming ability has not yet encountered a bottleneck, and it can improve steadily over time. There is potential that one day, through a quantitative change in leaderboard scores, it will turn into a qualitative change that changes the world.”

With technological development, AI may evolve from gradually reducing human involvement in software development, to AI independently proposing new requirements to evolve the codebase, and then to AI ultimately surpassing humans, abandoning humans, and finally achieving continuous self-evolution.

References:

Related paper:
Project homepage:

Layout: Liu Yakun

Massive information, precise insights—exclusively on the Sina Finance APP

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

1 Likes