Oliver Sourbut comments on Deceptive Alignment and Homuncularity

Oliver Sourbut 16 Jan 2025 16:43 UTC
5 points
2
I appreciate you revisiting another couple of months later! This topic continues to evolve.

It was also an update for me that merely-RLHF-ish training on top of GPTs appears to have been enough (in latest-generation models at least) to get something mostly-consistently ‘actually trying’ in an apparently coherent way. I expected that to happen; I didn’t know when. The lede which they somewhat buried in that paper was also that,

Furthermore, we find that this compliance gap persists in our synthetic document fine-tuned setup even when we entirely remove the model’s hidden chain-of-thought scratchpad

(though the gap is smaller and we don’t have the tools to understand the mechanism)

The bigger update from that paper, for me, was that the somewhat-coherent goals apparently-pursued by these things actually align with the target constitution somewhat! i.e. (that which was) reward(ed) was the optimisation target after all, a bit??

We’ll see how robustly these sorts of findings hold up to future developments and greater elicitation efforts.