If they were, they wouldn’t try to cheat at programming tasks, much less any of the other shenanigans they’ve been up to.
This seems wrong to me. We have results showing that reward hacking generalizes to broader misalignment, plus that changing the prompt distribution via inoculation prompting significantly reduces reward hacking in deployment.
It seems like the models do generally follow the model spec, but specifically learn not to apply that to reward hacking on coding tasks because we reward that during training.
This seems wrong to me. We have results showing that reward hacking generalizes to broader misalignment, plus that changing the prompt distribution via inoculation prompting significantly reduces reward hacking in deployment.
It seems like the models do generally follow the model spec, but specifically learn not to apply that to reward hacking on coding tasks because we reward that during training.