Heads up — this is a long one. Several months of learning and experimentation, all in one place. I almost broke it into a series and then couldn't figure out where to cut, so here it is, whole.
The dominant mental model of progress is linear. We set a goal, we execute a plan, we move from A to B. Time flows in one direction. Tasks have a start and an end. Strategy is the art of plotting the right path before you walk it.
Every chart we use to measure success has time on the horizontal axis. We move through it in one direction, accumulating progress on the vertical. Clean. Predictable. Wrong.
The runner who shaves two minutes off her marathon did not improve in a straight line — she ran, noticed what broke down at kilometer thirty, fixed her form, ran again. The startup that found product-market fit did not plan its way there — it shipped, watched what users ignored, rebuilt the thing nobody asked for into the thing everyone needed. The surgeon who became exceptional did not read more textbooks — he operated, debriefed, adjusted, operated again.
Underneath every compounding skill, every improving team, every business that keeps getting better — the same hidden structure. Do. Observe. Adjust. Repeat. A loop.
We never had to think much about this. The loop just ran, quietly, in the background — powered by human memory and habit and the slow accumulation of experience. What mattered was talent and effort. The loop took care of itself.
Then AI agents arrived. And the loop stopped being background infrastructure. It became the main event.
Because AI agents do not just make individual tasks faster. They make loops faster — and suddenly the quality of your loop is the lever that determines everything. Same model, same tools, same talent. The gap opens in the loop.
Quick definition before we go further, because the word "loop" is doing heavy lifting in this essay. A loop is a business task purposely designed in a circular shape — do something, observe what happened, validate it against where you wanted to be, feed that back into the next attempt. Each turn is informed by the last. Each turn gets a little closer. Iterative by design. Exponential by accident.
AI agents are part of a loop. They aren't the loop themselves. The loop is the system you build around them — the goal, the visibility, the rails, the checks. The agent is the engine that runs inside it.
What's new is that this engine doesn't need a human pressing Enter at every step. Loops used to be paced by people — by the hours in your week, the meetings on your calendar, the next morning's standup. Now they can be paced by software. Run overnight. Run while you sleep. Run a hundred in parallel. AI agents put loops on steroids — same iterative shape, just compounding at machine speed instead of meeting speed.
Which brings us to the uncomfortable bit. The frontier model on your laptop is the same one your competitor has. Same price, same context window, same weights. The thing that's different is the loop you wrap around it. And the loop is mostly you.
AEvery job is a loop wearing a costume
Six months ago I started paying close attention to two engineers on my team. I'll call them Person A and Person B — both still on the team, both still doing great work, so I'm keeping names out of this.
Person B was the one you'd hire first. Deep technical precision, the kind that takes years to build. He knew the codebase the way a surgeon knows anatomy — which module touched which, which change would cascade where, which edge case would surface three weeks later in production. Reputation in the wider developer community. Other engineers respected him because his work was correct. Before anything went live, he tested it. He thought through failure modes. He moved carefully, and the things he shipped rarely broke.
Person A was good but not exceptional by those same measures. His instinct was always to zoom out — systems, interfaces, integrations. How does this piece talk to that one? What does it expect back? He moved faster and made more mistakes. His PRs got more comments. But he was also almost always ready with a fix. The iteration was built into how he worked. Imperfection wasn't a problem to eliminate before shipping — it was a signal to act on after.
For most of engineering history, Person B was the more valuable hire. Precision compounds in a world where execution is expensive and bugs are embarrassing. B's perfectionism was a feature.
Then agents arrived. And something I didn't expect happened.
Person A took to them immediately. Not because he was more technically sophisticated — because agents are imperfect by nature. They make mistakes. They drift. They produce something eighty percent right and need steering for the last twenty. That's exactly the environment A had always worked in. His defense mechanisms — iterate fast, catch issues early, ship and fix — translated directly into how you work well with an agent. He started building workflows around them. Tooling. Scripts. Ways to catch drift before it compounded. He didn't wait for the agent to be perfect. He built a loop around its imperfection.
Person B found it harder, and I want to be careful here because his instincts had always been a strength. Every time an agent made a mistake, it registered as unreliability rather than a normal iteration signal. He kept holding the higher bar. He kept using the tools, but more carefully, more skeptically — closer to the way you'd use a capable assistant than the way you'd run a system. The agent would do a task. He would review it. Then the next task.
Six months later, the gap between them is significant. Not in intelligence. Not in raw technical ability — B is still the sharper engineer on that axis. The gap is in output. In velocity. In how much compounds week over week.
This is what a loop is. Not a methodology. Not a framework. A disposition — toward motion, toward correction, toward the next iteration rather than the perfect first one. Every job that produces anything has always been a loop. We just gave those loops different names: sprint, design thinking, build-measure-learn, PDCA, OODA. Same shape. Different consulting invoices.
What changed is not the shape. What changed is what can run it. And once you see what changed, the next question is: exactly when did it change?
BThe while loop just learned to write its own condition
Okay, here's the strangest thing about 2026.
The while loop. The most basic construct in computer science. The thing every freshman learns in week two. Quietly, it has become the most important idea in business.
For the first time, a general-purpose software loop can sustain itself autonomously across a meaningful horizon of work.
The while keyword first appeared in ALGOL 60.¹ Dijkstra's team in Amsterdam implemented it between November 1959 and August 1960. From that moment until about six months ago, every loop in your work needed a human to run it. A human had to decide what to do next. A human had to look at what happened. A human had to remember what worked last week. Humans are slow, expensive, and forgetful. So loops ran slowly.
That began to change well before 2026 — early agents could already handle narrow tasks on their own. But it crossed a threshold on February 5, 2026, when Anthropic shipped Claude Opus 4.6² — a model built to sustain an autonomous task over a long horizon. One million tokens of context. Agent teams. A model that doesn't just respond, but keeps going.
That sounds like a UX update. It is not. It is the moment the loop became something you could hand off.
One month later, on March 5, 2026, OpenAI shipped GPT-5.4 with native computer use.³ Agents that operate your desktop. They click, they scroll, they type, they verify. On OSWorld — the benchmark for using a computer — GPT-5.4 scored 75.0%, against a 72.4% human-expert baseline.³
* Replit Agent 3 launched September 2025, ahead of the main inflection
If the model releases tell you the line was crossed, the tooling releases tell you the entire industry agrees on what comes next. Look at what the leading coding tools each shipped between February and May 2026.
Cursor shipped Long-Running Agents on February 12.⁴ Multi-day autonomous runs. One developer initiated a 52-hour task that produced a pull request with 151,000 lines of code. Three weeks later, Cursor shipped Automations — agents triggered automatically by a code push, a Slack message, or a timer.
By early 2026, Replit's Agent 3 — which had launched the previous September — had already proven the self-testing loop concept.⁵ The agent builds your app, then tests it by clicking around in a browser the same way a human would, finds bugs, fixes them, and keeps going for 200 minutes without supervision.
Cline launched a CLI with YOLO mode — cline -y "task" and it runs to completion without asking permission.²² They open-sourced their agent SDK in May, with native sub-agents, agent teams, scheduled cron jobs, and checkpoints.
Claude Code shipped agent teams as a research preview.²³ One session acts as the team lead. Teammates work independently, each in its own context window, and message each other directly. Hooks fire at every lifecycle event so deterministic guardrails sit between the model and the work.
The labs are not only racing to make smarter models — they are racing to make the model loop better.
METR — an independent capability tracker — measures how long a task an AI can finish on its own. From 2019 to 2025, that horizon doubled every seven months. Post-2024, METR's January 2026 Time Horizon 1.1 report measured the doubling time at roughly 4.3 months — already a dramatic compression from the prior seven-month pace.⁶ The UK AI Safety Institute's cyber capability evaluations showed the autonomous cyber task horizon shrinking from an 8-month doubling pace (November 2025) to 4.7 months by February 2026.⁷ The curve itself is bending.
"To get the most out of the tools that have become available now, you have to remove yourself as the bottleneck. You can't be there to prompt the next thing. The name of the game is how can you get more agents running for longer periods of time without your involvement, doing stuff on your behalf."¹⁶
For most of the history of computing, the bottleneck of every loop was a human pressing Enter. That bottleneck is dissolving.
But that creates a new question. If a loop can now run without you pressing Enter, what keeps it pointed at the right thing? What keeps it from drifting? What turns "this loop runs" into "this loop wins"?
The answer is four things. Together they form the anatomy of every working loop.
CFour organs. Remove one and the loop dies.
I think of a loop as four organs. If any one is missing or weak, the loop limps. If all four are strong, it runs.
Goals
What do I need done — and how will I know when it is?
Observability
How do I give my agents access to every system they need to do the work?
Guardrails
What rules and policies must my agents follow, without exception?
Evals
How do I measure whether I am getting closer to my goals?
Each organ feeds the next. Goals tell the loop where it is going. Observability gives it the vision to see whether it is moving. Guardrails set the edges it cannot cross. Evals measure whether the movement is toward the goal — and that measurement loops back to refine the goals. Remove any one of the four and the loop either drifts, stops, or runs in the wrong direction. Here is how each one works in practice.
01Goals
A goal is not what the system is supposed to be. It is what the loop is supposed to do. This distinction matters more than it sounds.
"Build a great product" describes a system. "Reduce customer onboarding time from 14 days to 3 days by end of Q3" describes what the loop should do — and you can check whether it happened.
A few guidelines that separate a real goal from a wish:
It has a number. Not "improve performance" but "P95 latency under 200ms." Not "grow the business" but "$100M ARR by Q4 2027." If you can't attach a number, you haven't written a goal yet.
It has a time boundary. Open-ended goals give the loop no sense of urgency or direction. A deadline makes the goal checkable right now — are we on track or not?
More precise, verifiable goals are better — quantity matters less than clarity. A loop with twenty specific, non-conflicting goals can self-correct at twenty points. Think of them like user stories in a spec — each one narrow, each one observable. A loop with one vague goal has nowhere to course-correct until it's too late.
Goals describe outcomes, not tasks. "Write three blog posts a week" is a task. "Grow organic traffic to 50,000 monthly visits" is a goal. The loop decides the tasks. You write the goal.
Most loops fail here. Not because the work was wrong — because the goal was never written down in a form that anything could measure.
02Observability
Observability is the vision you give your loop. If the loop can't see what's happening, it can't learn. That's it. That's the whole rule.
Think of it this way: your agents should be able to see at least what you can see — ideally more. A developer's loop can see the codebase, the logs, the test results, the deployment status. A sales loop can see emails, call recordings, CRM notes, deal stages. A founder's loop can see commitments made, decisions taken, metrics moving.
What the loop cannot see, it cannot act on. And most work is less visible than people think. Code in a repo is observable. Emails are observable. The decision you made in a hallway conversation that nobody recorded is not. The insight you had on a walk and never wrote down is not. The context that lives only in your head — invisible to the loop, compounding nowhere.
The practical discipline of observability is making sure everything that matters lands somewhere retrievable. Not for process reasons. Because a loop that cannot see is blind — and blind loops drift.
03Guardrails
Guardrails are the hard gates — the things the loop must never do, no matter what goal it is chasing.
Start with the non-negotiables. Not values posters. Actual hard constraints: do not contact this client without approval, do not access production data outside these hours, do not send an external communication without a review step. These are binary. The loop either crossed the line or it did not.
Below the hard gates sit preferences and softer constraints — the things you would rather not do, the margins you will not undercut, the tone you want maintained. These are not absolute, but they shape everything.
The counterintuitive thing about guardrails: more precise rails are better. Tight, specific constraints let the loop run fast inside them. Vague constraints are worse than none. A loop with vague guardrails hesitates or goes wrong quietly. A loop with precise guardrails moves fast and fails loudly when it hits a boundary — which is exactly what you want.
04Evals
If observability is the vision you give your loop, evals are the measuring tape. They are how you know whether the loop is actually working — not just running, but producing the right outcomes.
An eval is an objective test the loop can run against itself. For a voice agent: did the first response arrive in under 500 milliseconds? For a clinical summarizer: did it capture every medication in the note? For a code agent: did the PR pass all tests without a human fix? Each one is a specific, measurable question with a yes or no answer.
The practical way to think about evals: they are the observable version of your goals. Your goal says where you are trying to go. Your evals tell you, at every step, whether you are getting closer. Write them like user stories with acceptance criteria — narrow, specific, checkable. A loop with twenty evals can self-correct at twenty points. A loop with no evals can only be judged at the end, when it is too late to course-correct cheaply.
In practice, evals operate at two levels. Step-level evals check individual actions — did this sub-agent do what it was supposed to? Trace-level evals check the whole run — did the loop achieve its goal? You need both, because regressions often hide inside a sub-step that looks fine in isolation.
Well-designed loops create their own eval scores over time — the system learns what good looks like from the history of its own runs. But do not wait for that. Give as many evals as you can from day one. The loop will surprise you with what it catches.
Now that we have the anatomy, the question is who owns it. That answer is reshaping every knowledge job.
DIf the loop does the work, what's your job?
So that's the four organs. Next question, and it's the one I don't see many people asking out loud:
Who owns them?
Not the executor — the loop executes. Not the manager — the loop doesn't need a manager. The person who owns the four organs is doing a job most companies don't have a name for yet.
This is the prediction this whole essay is built on:
An operator owns a loop. Their job is four things:
- Write the goal the loop is chasing.
- Make sure the loop can see everything it needs to.
- Set the guardrails the loop has to stay inside of.
- Watch the loop watch itself.
That's it. Their job is not to do the work inside the loop. The loop does the work. The operator decides whether the loop should keep running, change direction, or stop.
If the phrase "forward-deployed engineer" rings a bell, it should. In 2006 Alex Karp asked Shyam Sankar — then Palantir's COO — why French restaurants were so good. "At a French restaurant, the wait staff is actually part of the kitchen staff." Karp wanted that, for engineering. So they built it.
"Investors ridiculed us for creating a 'services' role that would only serve to depress the margins of a software company. Something they eventually realized was a feature, not a bug."¹⁸
Shyam Sankar, "The Primacy of Winning," Pirate Wires, April 9, 2024.
Until 2016, Palantir employed more Forward Deployed Engineers than software engineers. The mocked role was the company. In 2026 every frontier AI lab is quietly rebuilding the same model.
What nobody has named yet is the generalist version — the person who runs loops across sales, hiring, client delivery, internal operations, product development. Anywhere a goal can be made measurable. That is what we are calling the operator at Iksha Labs, and that is the role we are restructuring around.
Who makes a good operator?
The people most natively suited to running loops were already trained to think in feedback systems.
Systems thinkers — trained in cybernetics, complexity theory, control systems — already see the world as loops with delays, feedback, and side effects. Design thinkers — diverge-converge, build-test-iterate — are temperamentally suited to loops because they do not get attached to the first draft. OODA-trained operators already know that fast accurate decisions win and that orientation is the only place real edge lives. John Boyd, who developed OODA, was nicknamed "Forty-Second Boyd" because he held a standing bet at Nellis Air Force Base that he could defeat any pilot in under forty seconds. He never lost.
The surprising ones are loop-native too: improv comedians who commit to motion and adjust on the next line. ER triage nurses running Glasgow scales every fifteen minutes. Air traffic controllers updating vectors on a live radar. Poker players updating ranges on every street. None of them produces a plan. All of them produce a constantly-updated state.
The people who will struggle are the ones taught the opposite shape.
Donella Meadows said it sharper than anyone: "Systems can't be controlled, but they can be designed and redesigned… We can't control systems or figure them out. But we can dance with them."¹⁹
- You open every project with a 12-month roadmap. A looper opens with a 2-week experiment.
- Your status update is "we hit the milestone." A looper's is "we updated the prior."
- Your strategy deck has more slides than your team has shipped this quarter.
- You cannot name the last belief you killed.
- The word "pivot" still sounds like failure to you. You were trained for a world that ended.
Understanding the operator role raises a more uncomfortable question: if operators are the winners, what exactly are they winning? The answer is not the model. The answer is the loop — and the distinction matters more than most people realize.
EThe model is rented. The loop is owned.
Now here's the part nobody on AI Twitter wants to hear.
The dominant take is that frontier models commoditize skill. When everyone has Claude, the engineer who used to be twice as productive is now only marginally so. AI agents flatten the field.
I think it's exactly backwards.
The same Claude Sonnet 4.5 — identical weights, identical price, identical context window — produced 43.2% on one agent scaffold and 59.8% on another. Same model. Different loop. Sixteen-point swing.⁸
SWE-bench Verified · April 2026The leaderboard maintainer wrote: "The most important number on this table is not any individual score. It is the spread between rows that share the same base model."
The model is the floor. The loop is the ceiling. Whoever builds the better loop wins, even when the model below is identical.
This is also why, in May 2026, both frontier labs spent billions of dollars not to train better models — but to teach the world how to use the ones already shipped. Anthropic announced a new enterprise AI services firm on May 4 with Blackstone, Hellman & Friedman, and Goldman Sachs.⁹ Seven days later, OpenAI launched a $4B deployment company with TPG, Bain Capital, Brookfield, and SoftBank — 19 investors in total.¹⁰
The signal hidden in the OpenAI deal: the company guaranteed its private equity investors a 17.5% annual return floor across the five-year term, per reporting.¹⁰ A model lab that believed a better model would close the deployment gap does not guarantee 17.5% to lock in five years of distribution.
PwC, expanding their Anthropic alliance on May 14, put the cost of pre-AI workflows still running inside large companies at more than $2 trillion.¹¹
Alex Immerman and Santiago Rodriguez at a16z wrote in March 2026: "Better models don't make the application layer thinner: they make it more capable, because the hard part was never raw intelligence. It was knowing what to do with it."¹⁷
Now the part that hurts to admit. Loops are hard. Not technically hard. Practically hard. Hard the way culture is hard. Hard because they expose you.
Writing down a verifiable goal is uncomfortable. Building real observability takes months and ships no features. Translating "we don't compromise on quality" into a machine-checkable constraint is harder than it sounds. And evals — watching the loop watch itself instead of diving in and executing every step yourself — is the one most operators cannot resist breaking.
If you want a precedent, look at Toyota.
Toyota published its production system manual in English in 1977.¹² Free. NUMMI gave General Motors a working Toyota plant on US soil, with a trained workforce, for fourteen years. Surveys of Lean implementations consistently find failure rates in the 70–90% range across industries.¹²
IndustryWeek survey · All About Lean systematic reviewThe reason isn't that the manual was wrong. It's that copying a loop is harder than copying a diagram. "American management thinks that they can just copy from Japan — but they don't know what to copy!"²⁰ — Deming. Detroit copied kanban cards and 5S audits. They missed the loop.
The same will be true for AI agents. Anyone can run Opus 4.6 next week. Almost no one will build a loop around it that compounds. The gap between the framework is obvious and the practice is hard is exactly where moats live.
At Iksha Labs, we are inside that gap right now. Here is what one working loop actually looks like.
FLoop One at Iksha Labs — enterprise voice agents on autopilot
Okay, let me make this concrete. Here's a loop we run today at Iksha Labs for one of our healthcare enterprise clients — voice agents, built for healthcare compliance, production-grade, live with real clinical staff. (I'm keeping the client name out, but everything else is real.)
Healthcare compliance is not a checkbox. It is a moving set of constraints across data handling, audit trails, access controls, and clinical output validation — each one with teeth. Layered on top: voice experience requirements that matter enormously in clinical settings. Latency. Response character. Interruption handling. A nurse asking a voice agent a question mid-shift does not have patience for a two-second pause or an answer that sounds like a chatbot. Getting all of this right, the old way, was a multi-month process involving compliance teams, QA cycles, and rounds of clinical review that would stretch across entire quarters.
Today, end to end — from spec to build to testing to verification to evals to customer feedback — the same work takes days, not months.
What changed is not the complexity of the problem. The problem is just as hard. What changed is that we got the four organs right, and now the loop does the compounding.
The most striking thing: we were able to turn customer feedback into evals. Real clinical users flag an issue. We translate that flag into a measurable eval condition. The loop runs against it. If a scenario fails, it is almost always one of two things — an eval we need to tighten, or occasionally a goal we had not fully specified. We change one of the four layers. We run the loop again. That is the entire maintenance cycle.
We run it as a single loop made of smaller loops.
01Goals — translated out of enterprise English
The hardest part: making requirements machine-checkable. The traditional approach to a requirement like "the system must meet healthcare compliance standards" is to break it into Jira tickets and trust a project manager to chase compliance. That does not work for a loop. A loop cannot verify a Jira ticket.
So we took every compliance requirement and translated it into something explicit: a markdown spec, with the actual conditions, the actual definitions of what it means for this module, in this client's environment, to be compliant. Same shape for on-prem performance. Same shape for every non-functional requirement.
The bottleneck of the entire system sits here. Spec-writing is the only thing that has not yet been fully automated. Everything else downstream — code, tests, deploys — runs without us.
02Observability — giving the loop full vision
PRDs, design files, code, tests, builds, deploys, runtime logs, monitoring — every one of those was already digital and retrievable. The loop has read access across the whole stack. It sees what we see. Net: human-in-the-loop demand at the observability layer was close to zero from day one.
03Guardrails — the policy book
For each enterprise client, we co-author what we call the policy book: a plain-English document of every constraint, preference, and hard rule that bounds what the loop is allowed to do. Not a Confluence page. A live document the loop reads on every meaningful action. Verified and signed off by the client. Three parties, one document, no ambiguity.
04Evals — the measuring tape
Every meaningful sub-loop has evals running against it continuously. Did the voice agent respond in under 500ms? Did the clinical summarizer capture every medication? Did the scheduling agent stay inside the policy book? Each one a specific, checkable question.
For things that genuinely require human judgment — voice quality, clinical validation of imaging outputs — we wrote eval rubrics in plain English and put a human in that specific micro-loop. The human is not in every loop. They are in the loops where their judgment is the value.
Above all of it sits the meta-loop: agents reviewing the day's work, comparing it against the goals, watching for drift, surfacing anything the smaller loops missed.
What was actually hard
None of this was easy. And honestly, the things that were hard weren't the things we expected to be hard.
The hardest thing wasn't technical. It was psychological. Until people stopped needing to see the code before they trusted it, stopped feeling like they had to review every PR themselves, stopped sitting on documents for hours before releasing them — the loop couldn't run. The instinct to stay inside the work, to touch every output, to personally verify before moving forward — that instinct is exactly what makes a good engineer. It's also exactly what breaks an agentic workflow. This mindset shift was the biggest challenge when we started. It's still the biggest challenge today, even as the agents have gotten dramatically better. The tools move faster than the humans using them.
The second hard thing was the tooling layer. Out-of-the-box coding agents aren't enough. What made the real difference was the layer we built on top — custom harnesses, skills, context that persisted across sessions, workflows tuned to our specific way of working. In the early weeks, getting anything done was slow and painful. The models would lose context, repeat themselves, miss domain-specific constraints. But as we gave the system more history, more skills, more accumulated context about what we were trying to build — it got faster. Even when things didn't work, the iteration became faster. The return on investing in that layer compounded noticeably by week four and dramatically by week eight.
The third hard thing is one I don't think the industry has solved yet. Most frontier companies still get this wrong: throwing agents at a problem doesn't produce production-quality output by default. Generic agents produce generic results. The things that actually matter — the clinical constraints, the compliance requirements, the domain-specific edge cases, the particular way a client expects things done — are almost never in the model's training data at the specificity you need. You have to build them in. The custom harness, the domain-specific evals, the industry-specific policy book — these aren't overhead. They are the differentiator. They are what turns a generic agent into a loop that actually works in your specific world. And nobody can buy that from a vendor.
Specs go in. Production-grade software comes out. Humans intervene at three places: writing the initial spec, signing off the policy book, and judging the small set of outputs where taste is irreducible. Everything else — coding, testing, deployment, design, monitoring, documentation — runs without us.
The most surprising part has been design. When we started this approach a year ago, we assumed design would stay manual the longest. It has not. With Claude's design capabilities, Figma's MCP integration, and a small set of opinionated skills, the design loop ships better and faster than what a strong designer could ramp up to in the same window.
We still have a designer on this engagement. Their job has changed. They are not drawing screens anymore. They are running the design loop — writing the design spec, watching the design observability, holding the design guardrails, judging where the loop's taste needs to be corrected. They did not lose their job. They got promoted.
The only meaningful bottleneck left is how fast we can write the initial specs. This is Loop One. It is one of several we run at Iksha Labs. Every operator on our team owns one or more of them.
Five years from now, none of this should sound novel. Right now, in May 2026, almost nobody is doing it. That gap is the moat.
GWe are not alone — the same shape, three other roles
If Loop One sounds like a one-off, it isn't — and that's the part I want to flag. The same four-organ pattern is showing up across roles and companies, and in each case, the numbers confirm what the pattern predicts.
The solo founder running an agent swarm
The agents do the company. He watches the loop.
AI SDRs in production
The thing the salesperson used to spend half the week on — running the outbound — is done by the loop. The thing they used to barely have time for — thinking about the outbound — is now their job.
Autonomous coding in production
- Writes code directly
- Reviews every PR personally
- Bottleneck = coding speed
- Output: limited by hours
- Writes the spec & goals
- Watches the loop, not the code
- Bottleneck = spec quality
- Output: limited by loop design
The Greptile number is the one to sit with. The scaffolds were the same. The loops were the same. The models changed, and the output multiplied. That is what "well-designed loops can appreciate as model capability improves" means in practice — and it is the subject of the next section.
HMost software depreciates. Loops appreciate.
Here's something I keep coming back to. Most strategic investments in software depreciate. The framework you bet on in 2019 looks dated by 2023. The infrastructure you built in 2022 gets rewritten by 2025. Software has a half-life.
A loop, once built — with its goal, its observability, its guardrails, its evals — can get better every time the model underneath gets better, provided those evals, that context, and those guardrails remain valid. You don't have to rebuild the loop. You swap the engine — and if your evals catch the difference, the improvement is real.
The Greptile data from section G makes this concrete: the same scaffolds, running the same loops, went from single-digit AI-authored PR shares to over 25% in a year — without major changes to the scaffolds.¹⁵ The models changed significantly. Greptile's data suggests much of the gain tracked model improvements, with the loop structure providing the architecture that captured those gains.
Karpathy described what happened to his own coding workflow in late 2025: the ratio of manual to AI-driven work flipped dramatically in weeks. Same engineer. Same projects. The model crossed a threshold of coherence. This is not "AI got better." This is compounding — but only because the loop was already there to capture it.
A loop is a long-term asset. The model is a short-term lease. When you build a loop, you are not buying capability — you are building infrastructure that converts every future model release into more capability automatically, so long as the loop's foundations remain sound.
When the next model release arrives, your loop does not change. The model under it got smarter. The loop did more work, automatically, while you did nothing. The same will be true for the next release, and the one after that — as long as your evals and context keep pace.
The standard advice has always been focus. That advice is for an era of scarce, expensive execution. We don't live there anymore.
"If your plan is to keep doing what you're doing, AI is terrifying. If your plan is to build something dramatically bigger, it's the best news you've ever gotten."
— Garry Tan, Boil the Ocean
Build loops. Build them everywhere. Build them now. In sales. In hiring. In client delivery. In product. In customer support. In ops. In recruiting. In finance. In marketing. Anywhere a goal can be made verifiable, build a loop.
Most of them will fail at first. The goal will be wrong. The observability will be incomplete. The guardrails will be too tight or too loose. You will redesign each of them at least three times. That is the point. The loops you redesign will compound. The companies that didn't start can't redesign anything, because they have nothing to redesign.
Pull all of this forward and the question becomes: where does it end? Or more precisely — where does it go?
→Your loop audit: four questions
The framework is simple. The gap between reading it and using it is where the work lives. If you remember nothing else from this post, take these four questions and run them on any loop you own — or any project you're about to start.
If you answered all four: you have a loop. If you couldn't answer one or more: you know where to start.
IThe autopilot endgame — owning the recipe
One last thing — and I think it's worth sitting with where this goes.
Pull the trend forward five years. Models keep doubling. Compute keeps cheapening. The fraction of work a loop can finish without supervision keeps climbing. At some point, for a meaningful share of the work in a meaningful number of companies, the loops are on autopilot.
Every economy since the industrial revolution has been bound by the same constraint: the cost of execution per unit of human attention. When loops run unattended on near-infinite compute, that constraint relaxes by orders of magnitude. A founder who could once supervise three projects can now supervise three hundred. A research team that could once try one experimental direction can now try every direction at once. The limit was never the work — it was the human in the loop.
When loops can explore a search space at machine speed and machine breadth, they will inevitably stumble onto things humans missed. Not because the loops are smarter — but because they will run more attempts, in more directions, with more patience, than any human team could. Better drugs. Better materials. Better proofs. Better processes. The economics make breakthroughs dramatically more likely.
And here is the part that matters most for you, today.
The model is rented. The compute is rented. The agent framework is rented. The frontier capability is on everyone's laptop. None of that is your moat.
What is yours is the goal you wrote down, the observability you built, the guardrails you set, the evals you watch. And one thing the model cannot rent from anyone: your data, and the loop you wrap around it.
The companies that quietly accumulate proprietary data and well-designed loops over the next eighteen to thirty-six months will compound away from everyone else, because no model release will close that gap. Better loops on better data win, and they keep winning, because every new model release makes the same loops more capable — provided the evals stay honest and the context stays current.
This is why the operator role compounds in a way most others don't. Operators are the people who own the recipe.
The capability is on your laptop. The labs that built it are spending billions to teach the world how to use it. The doubling time is under five months and falling. Your competitor with the same model is, somewhere right now, designing a better loop than you.
The four buckets aren't hard. The first one is goals. Write yours down today — even badly, even with the wrong number, even on a napkin. You can fix it tomorrow. You can't fix what doesn't exist.
The model upgrade is on the house.
You are as good as your loops. Go look at yours.