karpathy/autoresearch | DeepWiki

Summary

DeepWiki describes karpathy/autoresearch as an autonomous ML research framework built around a deliberately tiny code surface:

  • prepare.py is immutable and owns data preparation, tokenizer training, constants, and evaluation.
  • train.py is the mutable core that the agent edits to try ideas.
  • program.md is the human-authored research brief that tells the agent what to optimize and how to behave.

The system is designed to let an AI agent run repeated overnight experiments on a small but real LLM training setup. In each loop, the agent edits train.py, runs training for a fixed 5-minute wall-clock budget, extracts val_bpb, decides whether the change improved the result, and either keeps or discards the commit.

Key ideas surfaced by DeepWiki

  • Fixed-time optimization: every experiment gets the same 5-minute budget, making runs directly comparable on a given machine.
  • Single metric: the main target is val_bpb (validation bits per byte), chosen because it remains comparable even if the tokenizer or vocabulary changes.
  • Single-file mutation: constraining edits to train.py keeps the search space manageable and diffs reviewable.
  • Human as org designer: instead of editing Python directly, the human mainly edits program.md, which acts like lightweight “research org code” for the agent.
  • Keep/discard frontier: all attempts are logged, but only improvements advance the git branch frontier.

Useful references

  • DeepWiki sections include: overview, system architecture, design principles, getting started, agent operation, metrics/evaluation, and advanced topics.
  • README quick start uses uv sync, uv run prepare.py, and uv run train.py.
  • The default workflow targets a single NVIDIA GPU and was tested on H100-class hardware.