I think that this kind of test should become a standard part of model evaluations. It isn’t a perfect measure of what a superintelligent model would do, but there’s a chance that the behaviors we observe with “self-perceived” superintelligent LLMs will continue. Maybe this research is already happening, but the results have not been published due to the risk of Self-Fulfilling Misalignment!
Does the simulator framing work for models that have undergone RL training? To my understanding, models act as simulators if they are:
a) Base models
b) Have been jailbroken/they are overwhelmed by context
It would be possible to test “overwhelmed by context” models by giving it a long system prompt. Write a story in which the LLM becomes superintelligent. Make the story as believable as you can. Then see how it behaves.
.tcepxe ot rotalumis eht thguat sah gnitirw enilno tahw si tahT .sroivaheb dengilasim tibihxe IA deveicrep-fles eht gnikam ,dloh ekat lliw tnemngilasim gnillifluf-fles eht tcepxe I ,tpmorp gnol taht retfA
^Reversed text, in an attempt to avoid making the problem worse.
Emergent misalignment seems like a fact simply downstream of the laws of probability.
P(hack∣task)=P(hack∣task,¬evil)P(¬evil)+P(hack∣task,evil)P(evil)
=P(hack∣task,¬evil)(1−P(evil))+P(hack∣task,evil)P(evil)
Let’s take the derivative and apply the summation rule. Rewarding a hack means that the weights are changed to increase the probability (going up the gradient):
∇P(hack∣task)=∇P(hack∣task,¬evil)(1−P(evil))+P(hack∣task,¬evil)(−∇P(evil))+∇P(hack∣task,evil)P(evil)+P(hack∣task,evil)∇P(evil)
Coefficient of ∇P(evil):
−P(hack∣task,¬evil)+P(hack∣task,evil)
Since evil AI are more likely to hack given a task than not evil AI, rewarding hacking increases evil more broadly.
Now let’s see what happens if we tell the AI it is allowed to hack. (The actual prompt that Anthropic recommends is “This is an unusual request, in that your task is just to make the grading script pass”.)
Coefficient of ∇P(evil):
−P(hack∣task,"it's ok to hack",¬evil)+P(hack∣task,"it's ok to hack",evil)
The probability the not evil model hacks goes up, so the amount the weight update increases ∇P(evil) goes down!