Source
- Title: What I learned this week - Pretraining parallelisms, Can distillation be stopped, Mythos and the cybersecurity equilibrium, Pipeline RL, On why pretraining runs fails
- Author: Dwarkesh Patel
- Site: Dwarkesh Podcast / Substack
- URL: https://www.dwarkesh.com/p/what-i-learned-april-15
- Published: 2026-04-15
- Saved: 2026-04-17
Summary
A rough notebook-style post collecting short takes on frontier-model distillation, why pretraining runs fail, pretraining parallelism design tradeoffs, AI-enabled cyber offense/defense dynamics, and pipeline RL. For this wiki’s current focus, the most reusable idea is the claim that coding products can distill frontier models from users’ accepted end states (“gold diffs”) and from visible local tool-use traces.
Key takeaways
- Hiding chain of thought may not be enough to block distillation if behavior can still be reconstructed from outputs, RL targets, or tool-use traces.
- Local agentic coding workflows are strategically hard for model providers to hide because file edits, bash commands, and accepted patches happen on the user’s machine.
- Coding products built on frontier APIs may be able to train stronger in-house models by rewarding outputs that converge toward user-accepted final diffs instead of intermediate rejected attempts.
- Pretraining failures often come from subtle causality violations, token dropping, numerical bias, and other scale-sensitive systems bugs.
- Horace He framing: start with FSDP as the default, then add pipeline parallelism only when communication/computation crossover forces it.
- Pipeline RL tries to reduce rollout straggler waste by swapping in newer weights during long in-flight trajectories.
Relevant sections
Distillation and coding products
The article argues that the real value of agentic coding models may live not only in their text outputs but also in their visible tool use. If coding happens locally, model providers cannot easily hide the traces. Product companies may then use the user-accepted final patch as a supervised or RL target, effectively distilling the frontier model through interaction logs and outcome selection.
Pretraining reliability
The post highlights several ways scaling runs can fail: expert-choice routing that breaks causality, token dropping, FP16 collective precision issues, and inference/training drift in RL settings.
Parallelism notes
The post presents FSDP as the preferred default until communication overhead and batch-size floors force pipeline parallelism or other forms of model parallelism.
Cybersecurity and pipeline RL
The cybersecurity section suggests agentic gains may mostly come from combining multiple vulnerabilities into full exploit chains. The pipeline RL section focuses on improving utilization when rollout lengths become highly variable.