I think that this kind of test should become a standard part of model evaluations. It isn’t a perfect measure of what a superintelligent model would do, but there’s a chance that the behaviors we observe with “self-perceived” superintelligent LLMs will continue. Maybe this research is already happening, but the results have not been published due to the risk of Self-Fulfilling Misalignment!
Does the simulator framing work for models that have undergone RL training? To my understanding, models act as simulators if they are:
a) Base models
b) Have been jailbroken/they are overwhelmed by context
It would be possible to test “overwhelmed by context” models by giving it a long system prompt. Write a story in which the LLM becomes superintelligent. Make the story as believable as you can. Then see how it behaves.
.tcepxe ot rotalumis eht thguat sah gnitirw enilno tahw si tahT .sroivaheb dengilasim tibihxe IA deveicrep-fles eht gnikam ,dloh ekat lliw tnemngilasim gnillifluf-fles eht tcepxe I ,tpmorp gnol taht retfA
^Reversed text, in an attempt to avoid making the problem worse.
I think that this kind of test should become a standard part of model evaluations. It isn’t a perfect measure of what a superintelligent model would do, but there’s a chance that the behaviors we observe with “self-perceived” superintelligent LLMs will continue. Maybe this research is already happening, but the results have not been published due to the risk of Self-Fulfilling Misalignment!
Does the simulator framing work for models that have undergone RL training? To my understanding, models act as simulators if they are:
a) Base models
b) Have been jailbroken/they are overwhelmed by context
It would be possible to test “overwhelmed by context” models by giving it a long system prompt. Write a story in which the LLM becomes superintelligent. Make the story as believable as you can. Then see how it behaves.
.tcepxe ot rotalumis eht thguat sah gnitirw enilno tahw si tahT .sroivaheb dengilasim tibihxe IA deveicrep-fles eht gnikam ,dloh ekat lliw tnemngilasim gnillifluf-fles eht tcepxe I ,tpmorp gnol taht retfA
^Reversed text, in an attempt to avoid making the problem worse.