After AI devours everything, what remains untrainable?

Share To

Lead-in: As AI capabilities continue their relentless ascent, a new pessimistic sentiment is emerging in the investment world: if models keep growing stronger, all application-layer companies will eventually be swallowed by model and compute layer giants like Anthropic, OpenAI, and Nvidia, leaving only frontier models, raw compute power, and a few key infrastructure players. But Sarah Guo argues that this judgment gets only half of it right. Indeed, "thin wrapper" applications—those merely wrapping basic models—are being absorbed. Any task that can be benchmarked, trained on public data, and validated at low cost will gradually become commoditized.

The real question is: after AI consumes everything trainable, what remains untrainable?

The answer lies in value embedded within real organizations, difficult to replicate from outside: proprietary enterprise data, complex workflows, user trust, system permissions, domain-specific judgment, compliance responsibilities, and experience accumulated through long-term operations. Models may grow smarter, but they cannot automatically enter a bank’s production systems; they may generate medical answers, but cannot instantly earn doctors’ trust or gain access to hospital decision-making pipelines; they may draft legal documents, but cannot assume liability for seasoned lawyers, nor can they arbitrarily define what constitutes qualified legal work.

Thus, the truly defensible AI companies of the future won’t just be smarter than general-purpose models—they’ll deeply embed themselves within specific industries, performing the hard but critical “translation” work: transforming clients’ private realities, tools, processes, and judgment criteria into systems actionable by models, and iteratively defining what “good outcomes” actually mean over years of service. The stronger AI becomes, the more it devalues measurable, replicable tasks—and the more it highlights the irreplaceable value of those intangible assets rooted in history, relationships, permissions, and expert judgment. This is the true residual value that survives even after the model takeover.

Below is the original text:

Mid-2026, investors are experiencing a version of “AI psychosis”: an existential despair that nothing worth investing in remains. We should, it seems, pour all our capital into Anthropic and Nvidia and go home to sleep. But I’ve never felt that way. For several iterations now, I’ve been convinced that models already outsmart me; if priced fairly, I’d gladly buy into Anthropic and Nvidia; my smartest friends are equally certain that self-improvement in models will soon run truly real—yet I still don’t feel that despair.

This despair isn’t foolish. Its logic goes like this: if models keep getting stronger across every dimension, then all companies built on top of them are just thin shells waiting to be absorbed; ultimately, only raw compute and frontier model weights will survive.

Software serves as the most relied-upon case in this despair narrative. When Devin launched in 2024, it could solve only 13% of standard software benchmarks—so the market largely dismissed it. A year and a half later, the strongest agents are scoring over 80%, and already handling real work inside Goldman Sachs and the U.S. Army. Almost everyone drew the same wrong conclusion: models have eaten software engineering.

But once models have consumed the most easily measurable portion of software engineering, we’re relearning something teams have known all along: engineering has always resisted measurement, and the easiest-to-measure parts aren’t necessarily the most important ones.

MIT’s Mert Demirer and his collaborators have finally quantified this: among over 100,000 developers, the latest generation of coding agents increased code-writing volume by about 180%, but actual delivered production code only rose by around 30%. Writing code got cheaper—but the remaining steps still require human hands, and those steps matter profoundly. Of course, the net impact remains staggering.

Benchmarks are things you can measure; and anything measurable can be used to train. Thus, coding agents matured first: compilers serve as free validators, test suites too. When answers can be self-checked nearly at zero cost, you can continuously refine around that signal until you break through.

But passing tests doesn’t mean the change is correct for a decade-old codebase. That module exists for three reasons no one wrote down; the deployment pipeline might barely hold together thanks to a cron job nobody admits writing.

This kind of correctness cannot be read from leaderboards—or even directly from any artifact. You can only know whether such a complex system truly works by running it long enough in the real world. And smarter models don’t make real-world operation faster. No one trusts a system as large as Google just because unit tests pass and green checkmarks appear. You trust it because it has endured years of real load.

This kind of correctness is not only private—it forms a slow-building moat, one capital cannot compress time on. Even optimists admit this clock cannot be skipped. OpenAI’s pioneering reasoning model researcher Noam Brown recently wrote: the only reliable way to assess an agent’s performance over a year may be to let it actually run for a year.

As Gabe Pereyra put it, true automation isn’t just about models getting stronger. It’s about products, models, workflows, and company organizations evolving together—and three of these four evolve at the pace of the organization.

Moving people is beyond any benchmark’s reach: convincing a skeptical partner to change her workflow, keeping team cohesion intact during reconstruction. This is why when hiring CEOs, we value their ability to handle people at least as much as analytical prowess. Smarter models don’t alter this weight.

The feedback here is ambiguous, spans years, and trust belongs to a specific individual. Every company I know has already equipped each engineer with cutting-edge coding models—but none have seen their engineering organizations evolve at anything close to the speed of model advancement. Adopting tools took just one quarter—a magical token growth quarter! But real rebuilding takes years.

Measurable work is disappearing. Truly valuable work is structurally unreadable: anything you can put on a leaderboard can be trained on; thus, anything measurable is already heading toward commoditization. This process takes time and never completes fully—but its direction never reverses.

As my friend Matt MacInnis of Rippling puts it, reframing it in monetary terms: a token used solely to answer a generic question is nearly worthless, since any model can answer it; but a token that performs inference over your company’s proprietary data is far more valuable, because it does what you actually want—not just generates a plausible-sounding answer.

Readable work is being consumed from two directions.

From below: tasks saturate. Once a task can be cheaply verified, buyers stop caring which model performed it and start asking only how much it costs. The task then falls to the cheapest open-source or distilled model available that week. So long as profit margins function, they will eventually dominate.

From above: labs are trying to make models swallow their own scaffolding. Retrieval, routing between cheap and expensive calls, tool usage, even reasoning strategies—all the mechanisms once wrapped around models—are being pulled into model weights until the “shell” itself becomes the model. This is the absorption boundary.

Profit pressure acts in another direction too: a general-purpose Agent must remain ready for anything, so costs are high; whereas a focused application can optimize a workflow to consume only a tiny fraction of tokens. Unlike labs selling those tokens, application companies can retain the margin in between.

Thus, we can ask two questions of any task: Is its correctness private, costly, and a truth existing only within a company’s internal data? Is it isolated within a system inaccessible to outsiders? Pair these with the task’s saturation level, and you get a 2×2 matrix.

Saturated, publicly verifiable work is the domain of commoditized tokens—open-source models will occupy it. Frontier but publicly verifiable work—like coding benchmarks—is where labs win, because when evaluation is free, owning it is meaningless.

The real prize lies in the final corner—the “untrainable” one: frontier work whose correctness exists only in private environments. You see this clearly in AI-native inference clouds serving early adopters: the vast majority of tokens come from custom models, not general-purpose open-source ones.

The wall leading to this last corner varies in height. A developer’s toy codebase is portable and standardized—easy to infiltrate. But a bank’s production system is neither portable nor standardized. You won’t gain root access just because you scored 2% higher on SWE-Bench Verified.

Capability eats much, but better models don’t turn private, real-world standards into public ones. They don’t hold licenses, sign responsibility waivers, or own corporate documents; when answers fail, they can’t be sued. The bottleneck isn’t intelligence—it’s permission and accountability. You can imagine a model vastly smarter than anyone else, yet it still needs to be allowed in—and someone must still sign their name to what it does.

That door has a lock and a latch.

The lock is environment: only after gaining trust within a system, passing security reviews, completing integration, and signing contracts bearing outcome liability, can you verify whether AI actually did something useful.

The latch is the user. Today, most American physicians open OpenEvidence daily—not something any compute power can buy. A lab could train a perfect medical model tomorrow, but it still couldn’t enter doctors’ habits or UCSF’s decision-making process. Because trust is built slowly, through relationships and user tacit consent—not via gradient descent erasing these elements.

This is precisely the work of application companies. An app occupies the “untrainable” corner not through flashy innovation, but through unglamorous labor: mapping a company’s private reality, enabling models to act upon it; handing action tools to models; co-evolving with clients to reshape how their workforce actually operates.

A company capable of this “translation” is hard to copy, and the translation never ends. Integration and maintenance persist alongside customer relationships. Success goes to teams that place domain-specialized engineers and tools beside their clients.

For example, in a top-tier legacy law firm, M&A deals alone approach a thousand annually. You can’t have hundreds of paralegals download client files to desktops and hand them to a general-purpose agent to read. Confidentiality alone forbids it—let alone dozens of other issues. Even if you could, you’d only capture fragments: one paralegal fixes a bit at a time, no one sees how an entire transaction flows.

The real signals exist at the transaction level. Each transaction has its own shape: for M&A, it’s NDA, term sheet, due diligence, purchase agreement, ancillary documents, closing checklist; for IP litigation, it’s motions, discovery, prior art, more motions. Each domain has its structure—lawyers and tools cannot be freely interchanged.

And the firm’s real problem sits even higher: how to simultaneously manage every business line, like a top partner juggling hundreds of matters while bringing in new clients and mentoring junior associates. Transforming such a firm isn’t a single problem you can write an evaluation task for. It requires a master strategist playing “data baseball”: intermediate goals are extremely fuzzy, feedback incomplete, cycles extremely long, and the environment itself never stands still.

Unfortunately, non-readable value is also hard to sell—for the same reason it’s hard to commodify: a company can’t externally judge whether AI can truly transform its operations like benchmark results suggest. Thus, the strongest companies stop trying to prove themselves externally and instead enter customers’ inner worlds, pricing based on results.

Sierra charges only when its agent solves the client’s problem; if it’s handed off to humans, it charges nothing. Thus, price becomes the assessment mechanism. And this works because Sierra controls the definition of “resolved.” Cognition’s Devin did the same in software with “performance guarantees.” Only when trusted inside a system can you offer such assurance.

Even at the token-service layer—what everyone loves to call pure commodity—performance isn’t truly commodity-like. Top AI-native companies concentrate services with one or two vendors, like Baseten or Fireworks. While per-token cost trends toward commoditization, reliability under real traffic and stable access to scarce compute do not. Where you host inference and which models you use are separate decisions. The only part of inference that’s truly commodity-like is price.

A common rebuttal is: “The lab is your supplier—why wouldn’t it undercut you with its own first-party product, drive you to extinction? Or simply revoke your API access and take the market?” This is the real version of that despair. But it only holds if the model layer is a solo game.

Clearly, it’s not. The model layer resembles a death match among three-and-a-half players, with a group of international competitors lagging by about six months, plus a development league five times larger than last year’s. Clients want competition among suppliers; labs want market share, not the destruction of any particular application.

You can see this in markets where labs compete head-on. In consumer chat scenarios, the best models have never simply dominated the entire market. ChatGPT has led for years through real competition; its recent share loss went to Gemini—due not to model superiority, but Android and search distribution advantages. Anthropic is currently perceived to have the best models in prediction markets and internet sentiment, yet it’s hardly a major player in consumer chat—instead building strong businesses in enterprise and coding contexts.

If a better model can’t dislodge users from core applications, it won’t easily eat a hospital’s EHR system or a bank’s liability framework. Today, public product choice depends on more than just coding ability. If the frontier model layer remains crowded, the application layer above it still holds value.

If a task can’t be scored externally, someone internally must decide what counts as a good answer. And that decision is the game itself. Enough such decisions written down become benchmarks. Harvey released legal benchmarks; Sierra released voice-agent benchmarks. You get the right to define “good” in a domain because the domain already uses you. These companies earned that right through arduous struggles in real adoption.

The real evaluation determining money flow is private and company-by-company formed: this company, in this matter, accepts what as good work. And this process is far from complete, because legal depth exceeds any public test. OpenEvidence is crystallizing what safe clinical answers look like.

None of this is truly “measurement”—it’s about judgment on what is real, what is good. These judgments are written down until they become standards others must accept. No matter how smart a foundational model lab becomes, it cannot conjure these standards ex nihilo, because such authority resides only within the domain.

This authority often lands where it already existed. Senior lawyers write legal benchmarks. Doctors define safe clinical answers. What “resolved” means is decided by the company already embedded in customer relationships.

The absorption boundary will keep rising, as we continually learn to measure more work, and measurable things get consumed. The ground beneath those who stand keeps shrinking—so you can’t find a defensible position and stop. You must keep moving toward places still unscorable, and continuously reassess risk, re-insure.

On a narrow task, leveraging your private data and internal evaluation system, you can train to frontier levels and beat general models in critical scenarios; this specialized model becomes part of your moat. Conversely, if you’re competing on general model capability, it’s a capital war—you lose to whoever commands the most compute. This is exactly the trap facing companies with shallow access and highly scorable tasks.

When a company decides to train beyond frontier models across a broad range of general tasks just to survive, victory is usually already determined by datacenter scale. The end result rarely produces an independent champion—it’s usually acquisition by some compute-rich player.

All of this is defense. The harder part is offense: first deciding what to build. This is what I’ve been searching for this year—and found only three times. Models can’t help with this. Point them anywhere, they’ll do it; but they can’t tell you what’s worth pointing at. You can’t build benchmarks for it, so you can’t train on it.

This is why incumbents won’t take everything: they defend what they already own, while the next big thing comes from someone who discovers use cases before others. Perhaps intention is a rarer input than compute.

This despair is half-right. Thin wrappers are indeed being absorbed, and many things that look like companies today are just thin wrappers. But the judgment about what remains afterward is wrong. The mechanism is clear, but the endpoint isn’t.

I’m willing to bet on this direction: intelligence will keep getting cheaper, and value will keep shifting toward the few places models can’t reach. Untrainable things carry historical value.

So enter one of these domains, do the unglamorous translation work, and begin writing down what “good” means there. Because someone will do it. The most frequently cited benchmark score this year is actually a map heading toward obsolescence—a notice to some: your power to define what “good” means is about to be lost.

[Original Link]

Original: BlockBeats

Disclaimer: Contains third-party opinions, does not constitute financial advice