I’m helping build geodesicresearch.ai
Puria
Announcing Geodesic Research
Learned Chain-of-Thought Obfuscation Generalises to Unseen Tasks
Because motivations are so underdetermined by the reward signal, we may be able to shape them without running into the downstream problems of blocking access to high-rewards, such as deceptive alignment
Could you help me understand this one a bit better? I would have thought that the degenerate mapping of motivations to actions means that many motivations can be mapped onto the same high-reward actions, and deceptive alignment is a case where actions consistently achieve a high reward, but its underlying motivations are misaligned, not vice versa. Sorry if I’m misunderstanding this sentence!
This is great—thanks for putting such a clear taxonomy and explanation together. I’ve not come across ‘value shaping’ before, and it feels like a missing link between gradient hacking and generalisation hacking. The definition of value shaping:
By anticipating the reward function, the model intentionally generates trajectories that earn high reward while including preferred values, backdoors, or steganographic triggers, forcing gradient descent to reinforce them.
[...] effectively “implanting” a persistent backdoor into the resulting policy, even after the exploration hacking incentive is removed.
is amenable to being read as “manipulating the loss landscape”, ableit by associating a particular cognitive pattern with high reward/low loss, and also as “encod[ing] information causing arbitrary OOD policies after [training]”. Certainly in generalisation hacking it was “the model’s goal is to reinforce or maintain a particular behavior, not to avoid [low loss]”.
Is this a valid way to think about this or am I missing the mark?
Strong agree! Stress testing alignment methods against heavy capabilities/RL training out in the open is the main focus of Geodesic’s research agenda. We’re focussed on the training methods you can use to improve an initialisation (midtraining through to warm-start reasoning) going into RL (benefitting greatly from NVIDIA’s open sourced post-training stack here), but we’re hoping this will spur on more open science into data heavy interventions at larger scales.