Daniel Kokotajlo comments on Thomas Larsen’s Shortform

Daniel Kokotajlo 30 Mar 2026 20:20 UTC
3 points
1
Unfinished musing, trying out another framing:

What we care about is “how the AI acts when it’s managing a giant bureaucracy of copies of itself, on a giant datacenter, having been given instructions by the humans to make rapid research progress as fast as possible while also keeping things safe, ethical, and legal, and while also providing strategic advice to company leadership...”

Which part of the modern training pipeline is most similar to this situation? That’s the part that will probably influence most how the AI acts in this situation.

Suppose the modern training pipeline has three parts: Pretraining, RLVR on a big bag of challenging tasks, and “alignment training” consisting of a bunch of ‘gotcha’ tasks where you are tempted to do something unethical, illegal, or reward-hacky, and if you do, you get negatively reinforced.

Seems like pretraining is the most dissimilar to the situation we actually care about. What about RLVR and alignment training?

I don’t think it’s obvious. The RLVR is dissimilar in that it’ll mostly be “smaller” situations. the model that first automates AI R&D won’t have been trained on a hundred thousand examples of automating AI R&D, instead it’ll have been trained on smaller-scale tasks (e.g. making research progress on a team of 100 fellow agents over the course of a few days?)

The alignment training is dissimilar in that way, too, probably.