Harness is trending — but people haven't understood who it really intends to take down.

金色财经_ · 2026-03-29T03:42:44+00:00

Palo Alto, in the morning, as the coffee was just served, Alan Walker looked down and came across that Anthropic article on harness, then looked up and said:“Many people think this is just a small improvement in the model. Wrong, this is the process beginning to betray humans.”At first glance, the article seems to be about engineering design, discussing the planner, generator, evaluator, and how to make Claude run continuously for several hours, creating more complex products.Most people stop here when they see this. They might think:Oh, so the agent is just more complex now, the prompt is longer, the workflow is more detailed.But Alan said, what’s truly worth paying attention to is never the surface functionality, but where the power is shifting to.In the past, to complete a complex task, someone had to break down the requirements, someone had to execute,

金色财经_

2026-03-29 03:42:44

Palo Alto in the morning, with coffee just served, Alan Walker looked down and scrolled through the article on harness from Anthropic, then looked up and simply said:

“Many people think this is a slight improvement in the model. Wrong, this is the process starting to betray humans.”

This article superficially discusses engineering design, planners, generators, evaluators, and how to enable Claude to run continuously for hours and create more complex products.

Most people stop reading here. They would think:

Oh, it turns out the agent is more complex, the prompt is longer, and the workflow is more detailed.

But Alan says, what’s truly worth noting has never been the superficial functions, but where the power is shifting.

In the past, for a complex task to be completed, there had to be someone to break down the requirements, someone to execute, someone to check, someone to redo, and someone to provide a safety net.

Now, what Anthropic is doing isn’t making the model more like a smart employee but letting the entire system begin to take over the organizational authority, supervisory power, and acceptance rights that originally belonged to humans.

Harness is not a plug-in. Harness is the machine beginning to grow a “management layer.”

That’s what is truly frightening about it.

01 Not a tool, but “the layer that manages tools”

When many people see harness, their first reaction is: isn’t this just another agent framework?

This understanding is too superficial.

The essence of an ordinary tool is to follow commands and execute. You click it, it does something. If you don’t say anything, it remains still.

But harness is no longer following this logic. What it truly does is to software-ize the division of labor structure that was originally hidden within human teams:

Who understands the requirements, who breaks them down into stages, who executes, who checks, who has the authority to send it back for rework after finding issues.

In other words, Anthropic isn’t piling on more functions but is writing into the system how to “organize work” itself.

Why is this step important? Because the hardest thing to replicate in the past has never been single-point capabilities but organizational capabilities.

Many people can write code.

But few can organize a dozen people, several steps, and multiple rounds of rework to ensure stable delivery.

What harness is touching is precisely this most valuable aspect.

Tools improve efficiency; organization determines output.

An individual model is just labor; Harness has begun to touch on company structure.

When AI not only can do the work but also starts to divide tasks, transition responsibilities, and hold accountability, it is no longer just a “tool upgrade.”

02 Not smarter, but less likely to fail

The most confusing aspect of models is that they always seem very intelligent in short tasks.

Ask it a question, and it answers coherently; ask it to write a piece of code, and it often looks good. Thus, many people mistakenly think: since it can handle short tasks, long tasks are just about running longer, right?

Not at all.

The real difficulty of long tasks has never been that a certain step cannot be done but rather maintaining coherence, control, and avoiding self-deception after several dozen steps.

Humans face the same issue in projects. The biggest fear is not the inability to do something but rather getting lost further into the process:

Requirements become unclear,

Goals start to drift,

Logic becomes inconsistent,

In the end, what people excel at is not completing the task but writing a summary that looks like it’s finished.

The core issue mentioned in Anthropic’s article is fundamentally this:

Models gradually lose their focus in long-term tasks. The longer the context, the more chaotic the state becomes, and it is easier to fall into a psychological illusion of “close enough.”

The value of Harness is not in making it more agile but in ensuring it is less scattered, less vague, and less easily brushed aside.

Breaking down phases, making handovers, defining contracts, conducting independent evaluations, and rolling back failures may seem like process details, but they are all addressing the same underlying issue:

Intelligence can be unstable, but delivery cannot rely on luck.

So if you really want to understand harness, you must first comprehend one thing:

What will truly be valuable in the future is not who can occasionally produce an impressive demo.

But who can push the system forward consistently over hours, days, or even longer without falling short.

Being able to write isn’t impressive.

What’s impressive is finishing without collapsing at the end.

A sudden burst of inspiration isn’t valuable; stable delivery is valuable.

Alan says the coldest cut in Anthropic’s article isn’t the planner, nor the generator, but the evaluator.

Why?

Because large models have a fault that is extremely similar to humans: they always think their own work is okay.

As long as there are no external constraints, it easily gives a self-assessment of “overall good,” “basically complete,” or “core functions are already in place.”

The problem is that these assessments are often not lies but a form of systemic self-indulgence.

In human companies, why do many projects end up failing?

Because those doing the work are often the best at finding excuses for themselves.

The doers claim it’s almost done,

The evaluators are too lazy to look deeply,

So a “close enough” product gets passed along and ultimately explodes in the users’ hands.

One of the ruthless aspects of Anthropic is that it directly separates these roles:

The doers are one role,

The error checkers are another role.

The former is responsible for pushing forward, while the latter is responsible for skepticism.

The logic behind this is very profound:

Once production rights and evaluation rights are separated, the system begins to form a true closed loop.

Moreover, what’s even scarier is that Anthropic doesn’t just let evaluators say a few words like “I think this part is bad.” It is striving to structure the “error-checking”:

Functions need to be tested, pages need to be clicked, interfaces need to be checked, database statuses need to be reviewed, and design quality is also broken down into measurable dimensions.

What does this mean?

It means that many judgment powers previously mystified by humans are gradually being dismantled into processes, standards, and thresholds.

What gets automated first is often not physical labor but the act of nitpicking.

Once “Does this thing work or not?” is processed as a flow, many people’s expertise will begin to leak.

In the past, many positions were valuable not because they produced but because they had the authority to say “Does this count as done.”

Now, that power is beginning to loosen from human hands.

03 The harshest cut is not allowing it to praise itself

Alan says the coldest cut in Anthropic’s article isn’t the planner, nor the generator, but the evaluator.

Why?

Because large models have a fault that is extremely similar to humans: they always think their own work is okay.

As long as there are no external constraints, it easily gives a self-assessment of “overall good,” “basically complete,” or “core functions are already in place.”

The problem is that these assessments are often not lies but a form of systemic self-indulgence.

In human companies, why do many projects end up failing?

Because those doing the work are often the best at finding excuses for themselves.

The doers claim it’s almost done,

The evaluators are too lazy to look deeply,

So a “close enough” product gets passed along and ultimately explodes in the users’ hands.

One of the ruthless aspects of Anthropic is that it directly separates these roles:

The doers are one role,

The error checkers are another role.

The former is responsible for pushing forward, while the latter is responsible for skepticism.

The logic behind this is very profound:

Once production rights and evaluation rights are separated, the system begins to form a true closed loop.

Moreover, what’s even scarier is that Anthropic doesn’t just let evaluators say a few words like “I think this part is bad.” It is striving to structure the “error-checking”:

Functions need to be tested, pages need to be clicked, interfaces need to be checked, database statuses need to be reviewed, and design quality is also broken down into measurable dimensions.

What does this mean?

It means that many judgment powers previously mystified by humans are gradually being dismantled into processes, standards, and thresholds.

What gets automated first is often not physical labor but the act of nitpicking.

Once “Does this thing work or not?” is processed as a flow, many people’s expertise will begin to leak.

In the past, many positions were valuable not because they produced but because they had the authority to say “Does this count as done.”

Now, that power is beginning to loosen from human hands.

04 What gets eaten first isn’t programmers, but “close enough”

Upon seeing such articles, many people’s reflex response is: are programmers going to be finished?

Alan says this question is too superficial and lazy.

The first wave of what Harness is consuming isn’t a specific job title.

What it first consumes is a way of survival that has long existed and is very common in almost all knowledge work:

Unclear requirements, just start working;

If things go awry midway, fix it later;

Results are mediocre, but it can run;

Documentation isn’t clear, but everyone in the team understands;

Launch first, fix issues later.

In simple terms, this is a whole set of work methods based on ambiguity and human flexibility.

Many projects continue to move forward not because the processes are genuinely clear, but because there are always people relying on experience, stepping in, and making ad-hoc judgments to fill the gaps.

What Harness is doing is precisely the opposite.

It is compressing the ambiguity space.

It is compressing the excuse space.

It is compressing the survival space of “I think,” “close enough,” and “should be okay.”

First define what “done” means for this round before allowing work to start;

If it doesn’t meet standards, send it back;

If it doesn’t pass evaluation, continue;

Don’t rely on feelings, but on evidence.

Once this logic moves forward, the most dangerous isn’t the best coder but those who rely heavily on gray areas for survival.

Harness isn’t eating programmers; it is first eating ambiguity.

Not everyone will be replaced, but every position that survives on ambiguity will first depreciate.

Many positions that previously thrived on information asymmetry will struggle with standard deviations in the future.

05 Why it has suddenly become popular now

Many people will ask, others have done similar workflow-based things before, why is it that this time everyone is taking it seriously?

Because the underlying models weren’t strong enough before.

To put it bluntly:

Many of these frameworks looked beautiful but were heavy and ended up being insufficiently robust.

You built a bunch of processes, piled up a bunch of roles, wrote a bunch of rules, only to package an unreliable model into a more complex but still unreliable system.

So it’s understandable that many people lost patience with agents, workflows, and scaffolds in the past.

It wasn’t that the direction was wrong, but that the foundation hadn’t reached that stage.

Now it’s different.

Once the model crosses a certain threshold, many processes that originally seemed decorative begin to release real value for the first time.

Because when the underlying model is strong enough, the processes are no longer supporting a failure but amplifying a system that is already capable of continuous operation.

This is why harness suddenly seems “a bit real” now.

It’s not that its concept has just emerged today; it’s that the model has finally become strong enough to reap the benefits of the process.

Alan’s statement is very apt:

Model capability is the engine, and Harness is the transmission.

Without a good engine, even the best transmission is just decoration.

But when the engine is powerful enough, the transmission begins to determine who can go fast and who is still just revving in place.

So this wave isn’t merely a technological trend but is sending a deeper signal in the industry:

Future competition will not only be about who has a stronger model but also about who integrates the model into the production system first.

06 “Humans are assumed to be in the middle”

Finally, Alan set down his cup and said the coldest line of the day:

“In the past, humans monitored software to get work done; in the future, software will monitor software to get work done.”

Why does this statement sting?

Because it reveals that what harness truly rewrites is not a specific role but a deeper premise that few have questioned in the past:

In digital labor, it is assumed that there should be a human standing in the middle.

They break down tasks,

They monitor progress,

They judge quality,

They coordinate rework,

They provide the final safety net.

This “human standing in the middle” could be called a programmer, a PM, a TL, a design lead, a QA, or a project manager.

The name doesn’t matter.

What matters is that the entire digital production system has historically relied on this human hub.

What Harness truly impacts is this central position.

It’s not saying humans should be immediately pushed out, but is gradually proving that:

Some decompositions can be done systematically,

Some oversight can be done systematically,

Some acceptance can be done systematically,

And some rollbacks and retries can also be handled without humans first noticing and addressing them.

As this becomes increasingly proven, the human position won’t disappear overnight but will begin to sink.

From being the assumed center to becoming an exception that intervenes;

From monitoring the entire process to only handling edge cases;

From being the master of the process to becoming the observer of the process.

This is what harness is truly consuming.

Not programmers.

Not product managers.

Not QAs.

But the deeper assumption behind these roles:

Humans are the center of the process by default.

And once this premise starts to loosen, the story that follows will be entirely different.

In the age of tools, it was about who could use the tools better.

In the Harness era, it’s about who accepts earlier:

That they are no longer naturally at the center of the system.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

1 Likes