In your work and in “Training a Reward Hacker with Perfect Labels”, misalignment is probably being transfered by 1) suspicious reasoning itself and 2) subliminal learning of the GH prompt itself (which could stand in for subliminal learning from anything a misaligned model produces). I think disentangling these is important, and I think your work partially does that by showing stronger reasoning filters reduce misalignment, but not completely. Ideally we could ablate reasoning in some setting (or have another LLM completely re-write reasoning traces given the prompt and output) and see whether there’s a subliminal transfer effect from just the output data. But some (or most) of the subliminal transfer would come from the reasoning itself, so this seems tricky
Cool work! I realize I’m a bit late to the party.
In your work and in “Training a Reward Hacker with Perfect Labels”, misalignment is probably being transfered by 1) suspicious reasoning itself and 2) subliminal learning of the GH prompt itself (which could stand in for subliminal learning from anything a misaligned model produces). I think disentangling these is important, and I think your work partially does that by showing stronger reasoning filters reduce misalignment, but not completely. Ideally we could ablate reasoning in some setting (or have another LLM completely re-write reasoning traces given the prompt and output) and see whether there’s a subliminal transfer effect from just the output data. But some (or most) of the subliminal transfer would come from the reasoning itself, so this seems tricky