This would make more sense if LLMs were directly selected for predicting preferences, which they aren’t. (RLHF tries to bridge the gap, but this apparently breaks GPT’s ability to play chess—though I’ll grant the surprise here is that it works at all.) LLMs are primarily selected to predict human text or speech. Now, I’m happy to assume that if we gave humans a D&D-style boost to all mental abilities, each of us would create a coherent set of preferences from our inconsistent desires, which vary and may conflict at a given time even within an individual. Such augmented humans could choose to express their true preferences, though they still might not. If we gave that idealized solution to LLMs, it would just boost their ability to predict what humans or augmented humans would say. The augmented-LLM wouldn’t automatically care about the augmented-human’s true values.
While we can loosely imagine asking LLMs to give the commands that an augmented version of us would give, that seems to require actually knowing how to specify how a D&D ability-boost would work for humans—which will only resemble the same boost for AI at an abstract mathematical level, if at all. It seems to take us back to the CEV problem of explaining how extrapolation works. Without being able to do that, we’d just be hoping a better LLM would look at our inconsistent use of words like “smarter,” and pick the out-of-distribution meaning we want, for cases which have mostly never existed. This is a lot like what “Complexity of Wishes” was trying to get at, as well as the longstanding arguments against CEV. Vaniver’s comment seems to point in this same direction.
Now, I do think recent results are some evidence that alignment would be easier for a Manhattan Project to solve. It doesn’t follow that we’re on track to solve it.
This would make more sense if LLMs were directly selected for predicting preferences, which they aren’t. (RLHF tries to bridge the gap, but this apparently breaks GPT’s ability to play chess—though I’ll grant the surprise here is that it works at all.) LLMs are primarily selected to predict human text or speech. Now, I’m happy to assume that if we gave humans a D&D-style boost to all mental abilities, each of us would create a coherent set of preferences from our inconsistent desires, which vary and may conflict at a given time even within an individual. Such augmented humans could choose to express their true preferences, though they still might not. If we gave that idealized solution to LLMs, it would just boost their ability to predict what humans or augmented humans would say. The augmented-LLM wouldn’t automatically care about the augmented-human’s true values.
While we can loosely imagine asking LLMs to give the commands that an augmented version of us would give, that seems to require actually knowing how to specify how a D&D ability-boost would work for humans—which will only resemble the same boost for AI at an abstract mathematical level, if at all. It seems to take us back to the CEV problem of explaining how extrapolation works. Without being able to do that, we’d just be hoping a better LLM would look at our inconsistent use of words like “smarter,” and pick the out-of-distribution meaning we want, for cases which have mostly never existed. This is a lot like what “Complexity of Wishes” was trying to get at, as well as the longstanding arguments against CEV. Vaniver’s comment seems to point in this same direction.
Now, I do think recent results are some evidence that alignment would be easier for a Manhattan Project to solve. It doesn’t follow that we’re on track to solve it.