Are extrapolation-based AIs alignable?

(This is an account of my checking a certain alignment idea and finding that it doesn’t work. Also my thinking is pretty naive and could easily be wrong.)

When thinking about AIs that are trained on some dataset and learn to extrapolate it, like the current crop of LLMs, I asked myself: can such an AI be aligned purely by choosing an appropriate dataset to train on? In other words, does there exist any dataset such that generating extrapolations from it leads to good outcomes, even in the hands of bad actors? If we had such a dataset, we’d have an aligned AI.

But unfortunately it seems hard. For example if the dataset includes instructions to build a nuke, then a bad actor could just ask for that. Moreover, if there’s any circumstance at all under which we want the AI to say “here’s the instructions to build a nuke” (to help a good actor stop an incoming asteroid, say), then a bad actor could extrapolate from that phrase and get the same result.

It seems the problem is that extrapolation doesn’t have situational awareness. If the AI is based on extrapolating a certain dataset, there’s no way to encode in the dataset itself which parts of it can be said when. And putting a thin wrapper on top, like ChatGPT, doesn’t seem to help much, because from what I’ve seen it’s easy enough to bypass.

What is the hope for alignment, then? Can we build an AI with situational awareness from the ground up, not relying on an “extrapolation core” (because the core would itself be an unaligned AI that bad actors could use)? I don’t know.

EDIT: the sequel to this post is Aligned AI as a wrapper around an LLM.