Canaletto comments on Daniel Kokotajlo’s Shortform

Canaletto 1 Feb 2025 20:40 UTC
3 points
0
I really like that description! I think the core problem here can be summarized as “Accidently by reinforcing for goal A, then for goal B, you can create A-wanter, that then spoofs your goal-B reinforcement and goes on taking A-aligned actions.” It can even happen just randomly, just from ordering of situations/problems you present it with in training, I think.

I think this might require some sort of internalization of reward or a model of the training setup. And maybe self location—like how the world looks with the model embedded in it. It could also involve detecting the distinction between “situation made up solely for training”, “deployment that will end up in training” and “unrewarded deployment”.
Also, maybe this story could be added to Step 3:
“The model initially had a guess about the objective, which was useful for a long time but eventually got falsified. Instead of discarding it, the model adopted it as a goal and became deceptive.”

[edit]

Aslo it kind of ignores that rl signal is quite weak, model can learn something like “to go from A to B you need to jiggle in this random pattern and then take 5 steps left and 3 forward” instead of “take 5 steps left and 3 forward”, maybe it works like that for goals too. So, when AI will be used in a lot of actual work (Step 5), they could saturate actually useful goals and then spend all the energy in solar system on dumb jiggling.

I think it might be actual position of Yudkowsky? like, if you summarize it really hard.