I’m helping build geodesicresearch.org
Puria
This is great—thanks for putting such a clear taxonomy and explanation together. I’ve not come across ‘value shaping’ before, and it feels like a missing link between gradient hacking and generalisation hacking. The definition of value shaping:
By anticipating the reward function, the model intentionally generates trajectories that earn high reward while including preferred values, backdoors, or steganographic triggers, forcing gradient descent to reinforce them.
[...] effectively “implanting” a persistent backdoor into the resulting policy, even after the exploration hacking incentive is removed.
is amenable to being read as “manipulating the loss landscape”, ableit by associating a particular cognitive pattern with high reward/low loss, and also as “encod[ing] information causing arbitrary OOD policies after [training]”. Certainly in generalisation hacking it was “the model’s goal is to reinforce or maintain a particular behavior, not to avoid [low loss]”.
Is this a valid way to think about this or am I missing the mark?
Could you help me understand this one a bit better? I would have thought that the degenerate mapping of motivations to actions means that many motivations can be mapped onto the same high-reward actions, and deceptive alignment is a case where actions consistently achieve a high reward, but its underlying motivations are misaligned, not vice versa. Sorry if I’m misunderstanding this sentence!