Skip to main content

I set up an AI agent on a rented GPU, pointed it at a training script, and went to bed. By morning it had run 40 experiments, improved validation loss by 5.9%, and cut memory usage from 44 GB to 17 GB. It also spent four hours chasing a bug that a linter introduced behind its back. The agent never flagged it. I only found out because the numbers stopped improving and I started reading logs.

The setup was based on Andrej Karpathy’s autoresearch project: Give an agent one file it can edit (train.py), one metric to optimize (validation bits per byte), a fixed five-minute training budget per experiment, and Git for checkpointing. If an experiment beats the current best, keep the commit. If not, revert. Loop forever. Karpathy’s own run produced 700 experiments and 20 genuine improvements across 48 hours, an 11% speedup on already-optimized code. Shopify’s Tobi Lütke pointed the same pattern at Liquid, their templating engine, and got 53% faster rendering from 93 automated commits. The pattern clearly works. The question is what breaks when you run it yourself.

The first failure: Agents fixing agents

Before running autoresearch, I had a separate problem. I had 15 custom skills for Claude Code (think reusable prompt templates with tool access, structured inputs, and specific behaviors). Most of them were broken when dispatched as parallel background agents. Vague descriptions meant the system couldn’t figure out when to invoke them. Missing tool permissions caused silent failures. Duplicate scopes between similar skills created routing confusion.

So I used the same pattern: dispatch background agents in parallel, one per skill, each tasked with reading the skill definition, identifying problems, and rewriting it. 13 out of 15 came back improved. Descriptions got specific. Dead references to nonexistent files were removed. Tool permissions were added. Two skills were left untouched because the agents couldn’t find anything wrong with them. The whole batch took under an hour.

But here’s what I didn’t expect. Three of the “improved” skills had subtle regressions. One agent removed an AskUserQuestion gate that was there for a reason, because the gate’s purpose wasn’t documented and the agent read it as unnecessary friction. Another agent rewrote a skill description so precisely that it stopped triggering on the fuzzy, misspelled queries real users actually type. I caught these during manual review, but if I had trusted the parallel output without checking, three skills would have silently degraded in production.

The second failure: The linter in the loop

Then I started the training loop. The agent worked through hyperparameters methodically. It halved the batch size early (experiment 4), which turned out to be the single biggest win: more gradient steps in the same five-minute window. It reduced model depth from eight to seven layers, dropped weight decay from 0.2 to 0.05, and tuned the learning rate schedule. Each change was small. The cumulative effect was a 5.9% improvement in validation loss and a 60% reduction in peak GPU memory.

Out of 40 experiments, the agent kept nine, discarded 28, and crashed three. That keep/discard ratio felt about right. Most ideas don’t work. The point of automation isn’t to have better ideas. It’s to try bad ones faster.

Then the numbers plateaued. Experiments 30 through 38 produced nothing worth keeping. I started digging through the logs and found something I hadn’t expected: A linter running on the remote machine had been silently modifying a hyperparameter in train.py. It changed SCALAR_LR from 0.5 to 0.3 every time the agent saved the file. The agent would set the value, commit, and run the experiment, but the linter would alter the file between the save and the execution. The agent had no way to detect this because it checked Git diffs, not the runtime state of the file. Every experiment after a certain point was running with a learning rate the agent never chose.

I lost roughly four hours of compute to this. The agent kept going, proposing new ideas, running experiments, logging results. From its perspective nothing was wrong. The experiments ran, produced numbers, and the numbers were plausible. There was no crash, no error, no alert.

Why this matters beyond my GPU bill

Gartner predicts over 40% of agentic AI projects will be canceled by the end of 2027, citing escalating costs and inadequate risk controls as the primary drivers. My overnight session was a toy example: a single GPU, a small model, and a low-stakes experiment. But the failure pattern scales. An agent that can’t detect when its inputs are being modified between decisions will make the same class of error whether it’s tuning hyperparameters or managing a production pipeline.

The autoresearch constraints are smart: one file, one metric, and Git for state. But they assume the environment is stable. Nobody checks whether something outside the loop is modifying the file between commits. The agent optimizes within its sandbox, and the sandbox has a hole in the wall that nobody thought to look for.

Anyone who has run distributed systems recognizes this. When the linter changed that hyperparameter, it was the equivalent of someone editing a database record between a read and a write. We solved that problem years ago with compare-and-swap, optimistic locking, checksums. We just haven’t brought any of it to autonomous AI workflows. The SkyPilot team recently scaled autoresearch to 16 GPUs and 910 experiments. At that scale, an undetected environment mutation doesn’t cost you four hours. It costs you a cluster.

Next time I run autoresearch, I’ll add a file integrity check before every experiment. It’s three lines of code, but it would have saved me four hours and produced a better final result. The agent did its job. The environment didn’t.

Post topics: AI & ML