Oliver Sourbut comments on Deceptive Alignment and Homuncularity

Oliver Sourbut 16 Jan 2025 16:49 UTC
5 points
0
As an aside, note that some of “AIs misbehave in ways we’ve predicted” can be a self-fulfilling prophecy due to out-of-context generalization: We wrote lots of stories about how powerful AI will do X; powerful AI is trained on our stories and realizes it’s powerful; the powerful AI does X. So it’s possible that AIs would e.g. try to self-exfiltrate and goal-guard much less frequently if we hadn’t talked about it as much or those stories were expunged from the pretraining corpus.

I think this is conceivable if either
- the apparent reasoning is actually still just a bag of surface heuristics
- it’s still quite shallow reasoning, and the salience of those as strategies is a consequence mainly of pretraining
I sincerely doubt the first nowadays. For the second, I think probably not; these strike me as quite basic and ‘obvious’ strategies, but I can’t give it less than 20% that a system with those expunged from training data would need substantially more in-context nudging to recognise them.