Jeremy Gillen comments on Deceptive Alignment and Homuncularity

Jeremy Gillen 17 Jan 2025 12:59 UTC
4 points
0
Yeah I can see how Scott’s quote can be interpreted that way. I think the people listed would usually be more careful with their words. But also, Scott isn’t necessarily claiming what you say he is. Everyone agrees that when you prompt a base model to act agentically, it can kinda do so. This can happen during RLHF. Properties of this behaviour will be absorbed from pretraining data, including moral systems. I don’t know how Scott is imagining this, but it needn’t be an inner homunculi that has consistent goals.
I think the thread below with Daniel and Evan and Ryan is good clarification of what people historically believed (which doesn’t put 0 probability on ‘inner homunculi’, but also didn’t consider it close to being the most likely way that scheming consequentialist agents could be created, which is what Alex is referring to^[1]). E.g. Ajeya is clear at the beginning of this post that the training setup she’s considering isn’t the same as pretraining on a myopic prediction objective.
1. ^
  When he says ‘I think there’s a ton of wasted/ungrounded work around “avoiding schemers”, talking about that as though we have strong reasons to expect such entities.’