The fact that a large language model trained to predict natural language text can generate that dialogue, has no bearing on the AI’s actual motivations
When you step off distribution, the results look like random garbage to you.
This is false, by the way. Deep learning generalizes quite well, (IMO) probably because the parameter->function map is strongly biasedtowards good generalizations. Specifically, many of GPT-4′s use cases (like writing code, or being asked to do rap battles of Trump vs Naruto) are “not that similar to the empirical training distribution (represented as a pmf over context windows).”
Imagine capturing an alien and forcing it to act in a play. An intelligent alien actress could learn to say her lines in English, to sing and dance just as the choreographer instructs. That doesn’t provide much assurance about what will happen when you amp up the alien’s intelligence.
This nearly assumes the conclusion. This argument assumes that there is an ‘alien’ with “real” preexisting goals, and that the “rehearsing” is just instrumental alignment. (Simplicia points this out later.)
Simplicia: I agree that the various coherence theorems suggest that the superintelligence at the end of time will have a utility function, which suggests that the intuitive obedience behavior should break down at some point between here and the superintelligence at the end of time.
I actually don’t agree with Simplicia here. I don’t think having a utility function means that the obedience will break down.
If a search process would look for ways to kill you given infinite computing power, you shouldn’t run it with less and hope it doesn’t get that far.
The reason—one of the reasons—that you can’t train a superintelligence by using humans to label good plans, is because at some power level, your planner figures out how to hack the human labeler.
What are “actual motivations”?
This is false, by the way. Deep learning generalizes quite well, (IMO) probably because the parameter->function map is strongly biased towards good generalizations. Specifically, many of GPT-4′s use cases (like writing code, or being asked to do rap battles of Trump vs Naruto) are “not that similar to the empirical training distribution (represented as a pmf over context windows).”
This nearly assumes the conclusion. This argument assumes that there is an ‘alien’ with “real” preexisting goals, and that the “rehearsing” is just instrumental alignment. (Simplicia points this out later.)
I actually don’t agree with Simplicia here. I don’t think having a utility function means that the obedience will break down.
Don’t design agents which exploit adversarial inputs.
Again, it’s probably due to the parameter->function mapping, where there are vastly more “generalizing” parameterizations of the network. SGD/”implicit regularization” can be shown to be irrelevant at least in a simple toy problem. (Quintin Pope writes about related considerations.)
A flatly invalid reasoning step. Reward is not the optimization target, and neither are labels.