AutoLab

Overview

AutoLab is a benchmark and task suite for measuring whether AI agents can participate in real empirical improvement loops rather than only solve one-shot reasoning problems.

Why it matters

It reframes evaluation around iterative experimentation instead of static answer accuracy.
It gives a concrete way to study whether agents can improve systems by running, measuring, and revising under real constraints.
It complements examples like autoresearch by turning the broader capability into a reusable benchmark.

Core framing

The article argues that frontier scientific and engineering progress is produced by loops of propose → test → measure → revise. AutoLab is designed to test whether a model can stay productive inside that loop when there is no answer key and the path to improvement must be discovered empirically.

What AutoLab measures

The benchmark emphasizes closed-loop resilience:

handling negative empirical feedback without losing direction
diagnosing why a run failed or underperformed
deciding whether to continue refining the current approach or pivot to a new one
making progress under finite budgets, noisy signals, and open-ended search spaces

Task design

The first release contains 23 tasks.
Tasks span systems engineering and frontier AI research.
Each task starts from a working but unoptimized environment.
Agents must improve a measurable objective within a fixed budget.
The benchmark includes objectives along three axes: faster, smarter, and smaller.

Representative examples

Data-selection tasks reward agents that inspect failure distributions and revise their sample-selection strategy based on observed training behavior.
Compression-style tasks reward architectural restructuring rather than endless local tweaking, making the benchmark sensitive to whether an agent can change frames instead of only optimize inside one.

Relationship to other pages

autoresearch shows a narrow but real autonomous research loop in practice.
agentic-research-autoscaling focuses on the infrastructure needed to support such loops.
background-coding-agents and multi-agent-workflows describe adjacent patterns for autonomous software work; AutoLab is the cleaner benchmark analog for research-style iteration.

Sources

Can Models Begin to Participate in the Loops That Drive Scientific and Engineering Progress? | AutoLab Blog

Carter's Knowledge Base

Explorer

AutoLab

AutoLab

Overview

Why it matters

Core framing

What AutoLab measures

Task design

Representative examples

Relationship to other pages

Sources

Graph View

Table of Contents

Backlinks