Cole Wyeth comments on Cole Wyeth’s Shortform

Cole Wyeth 16 Sep 2025 16:33 UTC
2 points
0
Simple argument that imitation learning is the easiest route to alignment:
Any AI aligned to you needs to represent you in enough detail to fully understand your preferences / values, AND maintain a stable pointer to that representation of value (that is, it needs to care). The second part is surprisingly hard to get exactly right.
Imitation learning basically just does the first part—it builds a model of you, which automatically contains your values, and by running that model optimizes your values in the same way that you do. This has to be done faithfully for the approach to work safely—the model has to continue acting like you would in new circumstances (out of distribution) and when it runs for a long time—which is nontrivial.
That is, faithful imitation learning is kind of alignment-complete: it solves alignment, and any other solution to alignment kind of has to solve imitation learning implicitly, by building a model of your preferences.

I think people (other than @michaelcohen) mostly haven’t realized this for two reasons: the idea doesn’t sound sophisticated enough, and it’s easy to point at problems with naive implementations.
Imitation learning is not a new idea so you don’t sound very smart or informed by suggesting it as a solution.
And implementing it faithfully does face barriers! You have to solve “inner optimization problems” which basically come down to the model generalizing properly, even under continual / lifelong learning. In other words, the learned model should be a model in the strict sense of simulation (perhaps at some appropriate level of abstraction). This really is hard! And I think people assume that anyone suggesting imitation learning can be safe doesn’t appreciate how hard it is. But I think it’s hard in the somewhat familiar sense that you need to solve a lot of tough engineering and theory problems—and a bit of philosophy. However it’s not as intractably hard as solving all of decision theory etc. I believe that with a careful approach, the capabilities of an imitation learner do not generalize further than its alignment, so it is possible to get feedback from reality and iterate—because the model’s agency is coming from imitating an agent which is aligned (and with care, is NOT emergent as an inner optimizer).
Also, you still need to work out how to let the learned model = hopefully simulation of a human recursively self improve safely. But notice how much progress has already been made at this point! If you’ve got a faithful simulation of a human, you’re in a very different and much better situation. You can run that simulation faster as technology advances, meaning you aren’t immediately left in the dust by LLM scaling—you can have justified trust in an effectively superhuman alignment researcher. And recursive self improvement is probably easier than alignment from scratch.
I think we need to take this strategy a lot more seriously.

Here’s a longer sketch of what this should look like: https://www.lesswrong.com/posts/AzFxTMFfkTt4mhMKt/alignment-as-uploading-with-more-steps