Following @karpathy-san’s banger release of autoresearch, many are embracing the use of AI to accelerate AI research.

But autoresearch introduces a new version of a familiar problem: how do you give research the compute it needs without wasting money on compute it doesn’t?

Modal fixes this. To demonstrate, Tony handed Claude Code some relevant Skills, pointed it at OpenAI’s Parameter Golf challenge, and went to sleep.

When he woke up 15 hours later, it had run 113 experiments across 238 GPU-hours, finishing the core training runs 5x faster than a single workstation would while using a fraction of the resources of a dedicated cluster.

Autoresearch 💚 Autoscaling

Research is unpredictable, and so are research workloads. A researcher, or their agent, might need dozens or hundreds of GPUs in parallel for a hyperparameter sweep, then drop to one GPU to debug an issue, then scale back up to several 8-GPU clusters for validation — all within the same work session.

A big always-on reservation gives you that burst capacity, but it’s expensive: you’re paying for a cluster even while the agent is “Thinking…”, not using it. And most clusters are hard to use. A single instance or workstation is cheap and easy to use, but it forces experiments to run serially, tanking your iteration speed.

What you really want is the best of both worlds: the ease-of-use and cost control of a single machine with the burst capacity of a thicc cluster.

And it’s not just how much compute, it’s also what kind. Debugging a CUDA error calls for an interactive sandbox where the agent can inspect state and iterate quickly. A 12-hour training run calls for a fault-tolerant batch job with retries and checkpointing. A hyperparameter sweep calls for dozens of independent jobs running in parallel.

Traditional cloud infrastructure forces you and your agent to pick one mode and stick with it. What you really want is for the agent itself to decide both how much compute and what kind of compute to use, moment to moment, and have the infrastructure follow.

That’s what Modal provides: the development experience human researchers and agents need, made buttery-smooth and cost-efficient by our custom serverless runtime.

An agent can write a training script, decorate it with @app.function(gpu='H100:8'), then launch it with modal run. If there’s a bug, it can call modal.Sandbox.create(gpu='H100:8') to spin up an interactive Sandbox. Either way, GPUs spin up in seconds, and scaling from one GPU to dozens or hundreds is just a parameter change. When the work is done, they release automatically — no waking up to a surprise bill from an idle cluster left running overnight.

Agents, like humans, already find Modal’s CLI-maxxing, code-mode interface easy to use. But also like humans, they appreciate some docs and guidance. So we wrote a set of Skills that guide the agent on how to use Modal’s compute primitives, including launching interactive Sandboxes, writing and running training jobs with modal run, managing persistent storage with Volumes, and orchestrating parallel sub-agents. Instead of learning Modal’s API from scratch, the agent gets the patterns it needs to provision GPUs, run experiments, and clean up after itself.

Parameter golf

OpenAI’s Parameter Golf challenge asks you to compress a language model into a ≤16 MB artifact that runs on 8×H100 in under 10 minutes, minimizing bits-per-byte. Because Modal let the agent provision and release GPUs through a simple API call, it could scale on its own — spinning up dozens of cheap single-GPU runs when exploring, running 5 parallel 8×H100 experiments when validating, dropping to serial execution when debugging, and scaling to zero when done.

Stage 1: Starting small for pipeline validation

The agent started with a smoke test: a single-GPU sandbox through modal.Sandbox.create(), a tiny 8M-parameter model, quantization, and evaluation. Four quick experiments over about an hour validated the pipeline end-to-end. Baseline BPB: 1.42.

Stage 2: Scaling out for broad exploration

The agent then launched ~40 independent single-GPU sandboxes in parallel for hyperparameter exploration, followed by 23 more focused runs and 4 larger pushes. BPB improved from 1.40 to 1.37 to 1.34. This phase totaled about 14 GPU-hours across 68 experiments.

Stage 3: Scaling up for validation

Next it switched from gpu='H100' to gpu='H100:8' and ran its top five configurations at full scale: 5 × 8×H100, 40 GPUs simultaneously. BPB dropped from 1.34 to 1.14. Modal finished the core validation in about 4 hours versus ~20 on a single 8xH100 workstation.

Stage 4: Scaling back down to debug

The agent then hit a bottleneck: GPTQ quantization was running on CPU and taking over 45 minutes, making the submission miss the 10-minute evaluation budget. It spent ~5.5 hours and 60+ GPU-hours trying larger timeouts before rewriting quantization to run on GPU, bringing the next full experiment down to 52 minutes total.

Stage 5: Scaling back up to finish

With the bottleneck fixed, the agent returned to parallel 8×H100 validation and optimization. Two parallel runs first confirmed the fix, then 5 parallel 8×H100 runs searched architecture, schedule, regularization, and data-mix variants. BPB improved to 1.1230, 1.1217, and finally 1.1206 before diminishing returns set in.

Takeaways

Agentic research benefits from elastic infrastructure, not fixed cluster reservations.
The key optimization target is matching both compute quantity and compute mode to the current research stage.
Modal’s primitives (modal run, Sandbox, Volumes, serverless GPU provisioning) let an agent move between debugging, sweeps, and full validation without replatforming.
The showcased run completed 113 experiments over 238 GPU-hours while delivering roughly 5× faster completion of core training than a single workstation.
The broader claim is that “research vs scaling” is a false trade-off when agents can dynamically provision and release compute.

Carter's Knowledge Base

Explorer

modal-autoresearch-parameter-golf-2026-04-16