# Self-Improvement Is Not Autonomy

Published: June 11, 2026
Author: Pradhan Sarathi
Canonical: https://predli.com/blog/self-improvement-is-not-autonomy

> AI is starting to build itself, but building itself is not the same as governing itself. The case for keeping humans in the judgment seat — and the system I built that does.

A couple of days ago Anthropic published *When AI Builds Itself* - a candid, genuinely good case that AI is starting to automate its own development, and that recursive self-improvement, where a model designs its own successor, may land sooner than most of us are planning for.[1](#ref-1) I read it with a pinch of salt. Not because the diagnosis is wrong - most of it is right - but because of the conclusion it draws from it.

        I'd know, because I've built a system that does more or less what the post describes. It documents itself, remembers across sessions, rewrites its own code, and deploys itself. And it has never once changed itself without me approving it. I'll come back to how it works. First, the fork - because the fork is the whole argument.

        Let me put their case at its strongest. Execution is being automated, fast: Claude writes most of Anthropic's production code now, and the trend behind that isn't a blip. METR has been tracking the length of task an AI agent can finish on its own, and it has roughly doubled every seven months for years - by early 2026 frontier models were handling things that take human experts hours.[2](#ref-2) What's left to humans, the argument goes, is judgment: picking which problems matter, reading a result, deciding what's worth doing at all. And here is the move I want to look at - they frame that judgment as a *narrowing* gap, and human review as a “new bottleneck.”[1](#ref-1) A bottleneck is something you engineer away. Follow that line and you arrive where they do: a future where the system sets its own agenda and humans step back to oversight.

        This is where I part ways. The diagnosis - execution is largely solved, judgment is the frontier - I'd agree with. The inference, that judgment is a gap that closes as the models improve, I wouldn't. To my mind judgment isn't a quantity these systems are slightly short of; it's a different kind of thing altogether. It leans on a persistent model of the world and an actual stake in the outcome, and a next-token predictor has neither - no continuity, no real preferences. That isn't a dismissal. These models are not glorified autocomplete; there is genuine, emergent capability in there. But emergent capability and situated judgment are not the same thing. Cognitive scientists draw the line precisely: today's models have striking *formal* competence - the fluent, rule-following machinery of language - but *functional* competence, the world-modelling and situated reasoning that judgment actually runs on, is a separate set of capacities they largely lack.[3](#ref-3)

        The empirical record points the same way. Apple's *GSM-Symbolic* showed that if you change only the numbers in a maths problem, or add a clause that sounds relevant but isn't, accuracy drops - which the authors read as models replaying patterns from training rather than reasoning.[4](#ref-4) *Faith and Fate* had shown the same thing structurally, with transformers solving compositional tasks by something closer to subgraph matching than systematic reasoning, decaying from near-perfect to zero as complexity rose.[5](#ref-5) The durable version of the claim isn't that a model trips over one particular puzzle - newer models clear the old puzzles - it's that the behaviour tracks the training distribution rather than the logic. McCoy and colleagues showed exactly that across task after task: models do markedly better on common variants of a problem than on rare ones, because they are shaped by the next-token objective they were trained on, not by the underlying rule.[6](#ref-6) You can't lean on the model to check its own work, either: a Google DeepMind team found that when a model is asked to revise its reasoning with no external signal, accuracy generally gets *worse*, not better.[7](#ref-7) Nor can you fully trust its account of how it got there - Anthropic's own researchers slipped models a hint, watched them change their answers, and found they admitted using it less than 20% of the time.[8](#ref-8) To me, this is more powerful pattern completion than the thing I would handover judgment to.

        On the contrary, let's say judgment really is emerging somewhere in the weights. It still wouldn't change how you should build, because we can't measure it. Our evaluations sample *capability*, not *reliability*: they tell you a model can do something some of the time, not that it will do it every time on inputs nobody anticipated. The benchmarks themselves can mislead - a NeurIPS best-paper showed that some of the dramatic “emergent” jumps in capability are artefacts of the metric you choose, smoothing into gradual curves the moment you measure them differently.[9](#ref-9) Chain a few stochastic agents together and it gets worse - a Berkeley study catalogued fourteen distinct failure modes across popular multi-agent frameworks and found the gains over a single agent often marginal.[10](#ref-10) I've come at this from the building side, too: in a 2025 whitepaper with AI Sweden, my co-authors and I had to stand up a bespoke, multi-dimensional evaluation framework for multi-agent systems - and a Gaussian-process Bayesian-optimization method to search for the configuration that reliably solves a given task set - precisely because no off-the-shelf benchmark tells you whether a system like that actually works.[11](#ref-11) Even METR's curve, the one that looks so much like a runway, is measured at a 50% success rate.[2](#ref-2) Half the time. That's a fair benchmark of capability and a useless one for production - nobody ships a workflow that works half the time. And the judgment numbers in the Anthropic post have a quieter problem: a human still defines what counts as a better answer, so what gets measured is the model getting better at producing things humans approve of. That is execution improving, dressed up as judgment. You can't bootstrap judgment out of a benchmark whose ground truth is a human's judgment.

        So I keep the human in the judgment seat - not as a bottleneck I'm waiting to remove, but as the load-bearing part of the design. Which turns the interesting question into an engineering one: how do you build a system that improves itself as aggressively as possible while keeping that seat occupied? That's what the rest of this is about.

## A system that builds itself - and still asks

  -

  PROPOSE

  APPROVE

  BUILD

  DEPLOY
**Fig.01**  Self-Build

        It started as a cleanup job, not a thesis. My team had adopted coding agents, and each one solved things its own way - bloating the codebase with abstractions we didn't need, duplicating types and constants, ignoring conventions the rest of us had agreed on. So I built a harness: one repository that encoded how we actually work.

        Then the obvious next steps kept presenting themselves. If the harness held our conventions, it could run agents against them. If it could run agents, one of them could read its own sessions and fold what it learned back into the harness. Then another to maintain the scaffolding around all of it. When the thing got too big for me to hold in my head, I had agents hunt and fix their own bugs - by asking me about them. Each loop, it got a little better. Today it documents itself, remembers across sessions, rewrites its own code, and deploys itself. We call it predlisys. It's also where my team does its actual work - the developer, reviewer, and planner agents are teammates we delegate to: a ticket to implement, a pull request to review, a fuzzy piece of work to break into simpler tasks.

        The biggest thing we build with it is [Predli Studio](https://studio.predli.com) - in effect an enterprise version of predlisys, the same idea aimed at a different audience. Where predlisys wraps a harness around developer workflows for people who live in a terminal, Predli Studio wraps an opinionated harness around enterprise workflows for people with no technical background at all. That the same approach holds at both ends is, to me, the quiet evidence for the argument that follows: the leverage is in the harness, and the harness scales.

        By any honest definition, predlisys is recursive self-improvement. It changes its own code, on its own schedule, in production. And it has never once done so without a human approving the change. A maintainer agent proposes; I, or another operator, decide; only then does a builder agent carry it out. The loop is real, and it's deliberately broken at exactly the spot where Anthropic's most aggressive future removes the break.

        Which is the whole point, and it's worth saying plainly: self-improvement and full autonomy were never the same thing. You can have a system that improves itself relentlessly and still gate every change on a human. Treating “it improves itself” and “it should run itself” as one claim is, I think, the central confusion of the moment. The rest comes down to how you build so the two stay separate - which is less exotic than it sounds.

## How it's actually built

  ORCHESTRATION
  HARNESS
  DATA
**Fig.02**  Three Parts

        The most striking thing about predlisys, if you go looking for the clever part, is that there isn't one. No vector database, no graph store, no “AI-native” filesystem, no bespoke agent orchestrator, no framework with “agentic” in the name. It's a single VM. SQLite for state. systemd timers for scheduling. bash for glue. git worktrees for isolation. The Claude Code CLI as the runtime the agents actually run in. That's close to a deliberate choice, not a shortcut I mean to fix later. A system that edits its own code and touches production wants the most boring, auditable, access-controlled substrate you can give it - not the newest one. Whatever an agent does, it does as ordinary, gated command execution in a persistent workspace, as a specific Linux user, with a specific set of permissions. Nothing an LLM touches gets a special path.

        Underneath, it's three pieces. The first is the VM and its orchestration - the part that would look familiar to anyone who has run a normal backend. A listener takes in webhooks (a Linear ticket, a GitHub review request, a Slack message), verifies them, and drops a row on a queue. A dispatcher polls that queue, spawns short-lived worker processes, supervises them, and recovers the ones that die. systemd timers fire the scheduled agents on a cron-like rhythm - the memoriser at night, the documenter after it, the maintainer twice a day, a monitor three times, garbage collection nightly. Anything that must not run twice at once, like a deploy, takes a plain file lock first. Operators are real Linux users; secrets live in a locked-down env file loaded per service; each workspace is group-owned so a human can open it in their editor without sudo. It's boring on purpose, and the boringness is the feature - all of it observable with `journalctl` and inspectable with `sqlite3`.

        The second piece is the agent harness - the layer that sits on top of Claude Code: the agents themselves, the hooks, the slash commands, the scoped tool access. Each agent is just a markdown system prompt plus a tight permission boundary. The *developer* agent's whole job is turning a Linear ticket, a Slack message, or a GitHub comment into a pull request; it gets the GitHub and Linear integrations and read/write to a defined subset of product repos, and nothing else. The *builder* agent is a meta-agent - its job is predlisys itself. It is the only agent allowed to modify the system, it deploys through a single locked script, and it never goes near the product repos. A pre-tool hook enforces all of this at the moment of the call, not by convention: it blocks force-pushes, blocks agents from pushing to main directly, blocks anyone but the memoriser from editing memory, anyone but the documenter from touching the docs, and every agent but the builder from running the deploy. The permissions are coarse, static, and written in bash - not a policy engine that itself needs trusting.

        The third piece is data, and it's split on purpose. Structured, deterministic state lives in one SQLite file - the work queue, the open questions waiting on a human, the worker table used for crash recovery, the dedup cache for webhooks. Freeform state lives as plain files on the VM: raw session transcripts as JSONL, memory as markdown, task artifacts. The split maps cleanly onto git, and that one line in the `.gitignore` is really a design statement - sessions are ignored (raw, enormous, disposable), memory is committed (curated, durable, the conventions the system has actually learned). The lived experience is throwaway; the distilled lesson is version-controlled. There's no retrieval magic here either - memory is just markdown files an agent reads at startup, the same way a new hire reads the team wiki.

        Step back and it's an old idea in a new costume: a control problem. You have an unreliable, high-bandwidth component and you want dependable behaviour out of the whole, so you wrap the stochastic core - the model - in a deterministic envelope of scripts, schemas, and gates. For the long-horizon, fuzzy work - planning, deciding what to do - the human sits in the slow outer loop, setting the targets. For execution, the gates act as interlocks: you don't let an unverified command reach production without a guard in front of it, and when the system is unsure it fails closed - it stops and asks rather than acting anyway. You contain the model because its unreliability isn't a phase it's about to grow out of. OpenAI's own researchers recently argued that hallucination is, in a formal sense, statistically inevitable in these models - a product of how they're trained and scored.[12](#ref-12) If the people building the models are telling you they'll confidently state false things as a matter of mathematics, the sane response isn't to wait for a better one. It's to build the envelope now.

## Controlled autonomy

  CONTROLLED AUTONOMY

  FULLY CONTROLLED
  traditional software
  FULLY AUTONOMOUS
  agentic, nondeterministic
**Fig.03**  The Spectrum

        None of this works if the system has to learn everything from scratch - and it doesn't have to, because most of what it needs is already known. The deterministic parts - the agent roles, the workflows we already understand, the software scaffolding - are bootstrapped in as initial context. The developer agent doesn't *discover* that its job is ticket-to-PR; it starts knowing that, already scoped to the right repos and tools. Memory starts deliberately thin - stubs, not a pre-filled brain. The system isn't meant to invent a personality; it's meant to start competent at the few things we already know it should do, and learn the rest narrowly.

        And it learns the way a good junior teammate does - on the small stuff, with a human in the loop. When an agent hits something genuinely ambiguous - unclear scope, a design call, a risky migration, a tool behaving oddly - it doesn't guess. It opens a structured question and waits; I answer; it carries on from where it left off, context intact. Overnight the memoriser reads the day's sessions, notices something that came up three or four times, and proposes writing it into memory - which I approve or wave off. Over weeks the stubs fill with conventions the system actually earned, the questions get rarer and sharper, and it gets genuinely reliable at handling things end to end. The human feedback is dense at first and thins as trust is earned - which is the opposite of the headcount story, and the honest version of it.

        This is what I mean by controlled autonomy, and it's the operating principle of the whole thing: humans do the thinking and the judging, agents do the executing. Once I've decided, the agent executes precisely and completely, with no second confirmation - that part really is autonomous. What stays with the human is the deciding. And the boundary between the two isn't fixed; it's a dial. An agent earns more room in a specific area as it proves reliable there, but a human always sets the dial, and it defaults closed wherever trust hasn't been earned yet. The boundary moves over time. The authority to move it does not.

## What this means

        If you're running an enterprise and weighing how to bring agents into a real stack, this is where it gets practical - and where I'd push back on the prevailing pitch. The pitch is: deploy agents, cut the human effort. In a decade of shipping production systems I've never once seen nondeterminism reduce the total human effort of keeping something running. It moves the effort - from writing to reviewing, supervising, correcting - and it adds a failure surface that wasn't there before. Validating a stochastic system's output before it touches production is a real, recurring cost, and it scales with how much the system produces, not against it. You can't automate that validator away without recreating the exact problem one layer up.

        So the advice more or less inverts the pitch. Architect the gates first and widen autonomy later - design the decision points and the deterministic envelope before you deploy, not as a patch after the first bad incident. Encode each human decision into something reusable - a rule, a memory file, a convention - so judgment compounds instead of getting re-litigated every time. Buy on reliability in your context, not on a benchmark score; treat “works sometimes” as “not trusted yet.” And aim for leverage, not headcount: let the agents absorb the toil and keep your people up at the altitude where their judgment is worth the most. A system that's slower but correct beats a fast one that's wrong every single time it touches production.

        I care about getting this right because the failure mode is quiet. The first risk is the one Anthropic itself names - errors compounding across iterations of an unsupervised system until they're past anyone's ability to catch.[1](#ref-1) Their answer to it lives between organisations, in coordination among labs. Mine lives inside the architecture: a loop that has to clear a human at every turn simply can't run unsupervised long enough to drift that far. The second risk is trust - over-promised autonomy that fails in production doesn't just cost that one deployment, it burns the credibility of everything genuine these systems can do, and there is a lot. The third is the slowest: deskilling. If we stop exercising judgment because the system seems to have it, the muscle atrophies - and it's exactly the muscle we'd need to catch the system on the day it's confidently wrong.

        Anthropic is right that AI is starting to build itself; predlisys is proof that it can. But building itself isn't the same as governing itself, and how fast a system improves tells you nothing about whether it should be the one deciding where to point. Execution is more or less solved. Judgment isn't a gap that's closing - it's a seat. The paradigm that wins keeps a human in it.

## Notes & references

          Anthropic, When AI Builds Itself, June 2026. [anthropic.com/institute/recursive-self-improvement](https://www.anthropic.com/institute/recursive-self-improvement) [&#8617;](#src-1)

          - METR, Measuring AI Ability to Complete Long Tasks, 2025 (task horizon at 50% success). [metr.org](https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/) [&#8617;](#src-2)

          - Mahowald, Ivanova, Blank, Kanwisher, Tenenbaum, Fedorenko, Dissociating Language and Thought in Large Language Models, Trends in Cognitive Sciences, 2024, arXiv:2301.06627. [arxiv.org/abs/2301.06627](https://arxiv.org/abs/2301.06627) [&#8617;](#src-3)

          - Mirzadeh et al., GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in LLMs, Apple, ICLR 2025, arXiv:2410.05229. [arxiv.org/abs/2410.05229](https://arxiv.org/abs/2410.05229) [&#8617;](#src-4)

          - Dziri et al., Faith and Fate: Limits of Transformers on Compositionality, Allen Institute for AI / UW, NeurIPS 2023, arXiv:2305.18654. [arxiv.org/abs/2305.18654](https://arxiv.org/abs/2305.18654) [&#8617;](#src-5)

          - McCoy, Yao, Friedman, Hardy, Griffiths, Embers of Autoregression: How LLMs Are Shaped by the Problem They Are Trained to Solve, PNAS, 2024, arXiv:2309.13638. [arxiv.org/abs/2309.13638](https://arxiv.org/abs/2309.13638) [&#8617;](#src-6)

          - Huang, Chen, Mishra, Zheng, Yu, Song, Zhou, Large Language Models Cannot Self-Correct Reasoning Yet, Google DeepMind, ICLR 2024, arXiv:2310.01798. [arxiv.org/abs/2310.01798](https://arxiv.org/abs/2310.01798) [&#8617;](#src-7)

          - Anthropic, Reasoning Models Don't Always Say What They Think, 2025 (chain-of-thought faithfulness often <20%). [anthropic.com/research](https://www.anthropic.com/research/reasoning-models-dont-say-think) [&#8617;](#src-8)

          - Schaeffer, Miranda, Koyejo, Are Emergent Abilities of Large Language Models a Mirage?, NeurIPS 2023 (Outstanding Paper), arXiv:2304.15004. [arxiv.org/abs/2304.15004](https://arxiv.org/abs/2304.15004) [&#8617;](#src-9)

          - Cemri et al., Why Do Multi-Agent LLM Systems Fail? (MAST taxonomy), UC Berkeley, NeurIPS 2025, arXiv:2503.13657. [arxiv.org/abs/2503.13657](https://arxiv.org/abs/2503.13657) [&#8617;](#src-10)

          - Kumar, Augustinsson, Zethraeus, Sarathi, Bridges, A Practical Approach to Optimize Multi-Agent Systems (v2), Predli AB / AI Sweden / Chalmers, Dec 2025. [ai.se](https://www.ai.se/sites/default/files/2025-12/A%20Practical%20Approach%20to%20Optimize%20Multi-Agent%20Systems-v2.pdf) [&#8617;](#src-11)

          - Kalai, Nachum, Vempala, Zhang, Why Language Models Hallucinate, OpenAI, arXiv:2509.04664, 2025. [arxiv.org/abs/2509.04664](https://arxiv.org/abs/2509.04664) [&#8617;](#src-12)