Daniel Kokotajlo comments on Evaluating the historical value misspecification argument

Daniel Kokotajlo 8 Dec 2023 6:02 UTC
LW: 2 AF: 2
0
AF
To me, a solution to inner alignment would mean that we’ve solved the problem of malign generalization. To be a bit more concrete, this roughly means that we’ve solved the problem of training an AI to follow a set of objectives in a way that generalizes to inputs that are outside of the training distribution, including after the AI has been deployed.

This is underspecified, I think, since we have for years had AIs that follow objectives in ways that generalize to inputs outside of the training distribution. The thing is there are lots of ways to generalize / lots of objectives they could learn to follow, and we don’t have a good way of pinning it down to exactly the ones we want. (And indeed as our AIs get smarter there will be new ways of generalizing / categories of objectives that will become available, such as “play the training game”)

So it sounds like you are saying “A solution to inner alignment mans that we’ve figured out how to train an AI to have the objectives we want it to have, robustly such that it continues to have them way off distribution.” This sounds like basically the whole alignment problem to me?

I see later you say you mean the second thing—which is interestingly in between “play the training game” and “actually be honest/helpful/harmless/etc.” (A case that distinguishes it from the latter: Suppose it is reading a paper containing an adversarial example for the RM, i.e. some text it can output that causes the RM to give it a high score even though the text is super harmful / dishonest / etc. If it’s objective is the “do what the RM would give high score to if it was operating normally” objective, it’ll basically wirehead on that adversarial example once it learns about it, even if it’s in deployment and it isn’t getting trained anymore, and even though it’s an obviously harmful/dishonest piece of text.

It’s a nontrivial and plausible claim you may be making—that this sort of middle ground might be enough for safe AGI, when combined with the rest of the plan at least. But I’d like to see it spelled out. I’m pretty skeptical right now.