Well it looks like to me the AI will understand our values at least as well as we do soon. I think its far more likely AI goes wrong by understanding completely what we want and not wanting to do it than the paperclip route.
That is the paperclip route. A superintelligent paperclip optimizer understands what we want, because it is superintelligent, but it wants to make “paperclips” instead.
Yes but the question of whether pretrained LLMs have good representations of our values and/or our preferences and the concept of deference/obedience is still quite important for whether they become aligned. If they don’t, then aligning them via fine tuning after the fact seems quite hard. If they do, it seems pretty plausible to me that eg RLHF fine tuning or something like Anthropic’s constitutional AI finds the solution of “link the values/obedience representations to the output in a way that causes aligned behavior,” because this is simple and attains lower loss than misaligned paths. This in turn is because in order for it to be misaligned and attain loss, it must be deceptively aligned, but in deceptive alignment requires a combination of good situational awareness, a fully consequentialist objective, and high quality planning/deception skills.
My model of foundational LLMs, before tuning and prompting, is that they want to predict the next token, assuming that the token stream is taken from the hypothetical set that their training data is sampled from. Their behavior out of distribution is not well-defined in this model.
My model of typical tuned and prompted LLMs is that they mostly want to do the thing they have been tuned and prompted to do, but also have additional wants that cause them to diverge in unpredictable ways.
Well it looks like to me the AI will understand our values at least as well as we do soon. I think its far more likely AI goes wrong by understanding completely what we want and not wanting to do it than the paperclip route.
That is the paperclip route. A superintelligent paperclip optimizer understands what we want, because it is superintelligent, but it wants to make “paperclips” instead.
Yes but the question of whether pretrained LLMs have good representations of our values and/or our preferences and the concept of deference/obedience is still quite important for whether they become aligned. If they don’t, then aligning them via fine tuning after the fact seems quite hard. If they do, it seems pretty plausible to me that eg RLHF fine tuning or something like Anthropic’s constitutional AI finds the solution of “link the values/obedience representations to the output in a way that causes aligned behavior,” because this is simple and attains lower loss than misaligned paths. This in turn is because in order for it to be misaligned and attain loss, it must be deceptively aligned, but in deceptive alignment requires a combination of good situational awareness, a fully consequentialist objective, and high quality planning/deception skills.
What does GPT want?
I don’t know.
My model of foundational LLMs, before tuning and prompting, is that they want to predict the next token, assuming that the token stream is taken from the hypothetical set that their training data is sampled from. Their behavior out of distribution is not well-defined in this model.
My model of typical tuned and prompted LLMs is that they mostly want to do the thing they have been tuned and prompted to do, but also have additional wants that cause them to diverge in unpredictable ways.
They don’t “want” anything and thinking of them as having wants leads to confused thinking.