I’m surprised that it still works this well through both filtering and SFT, but not that it works at all. Because the purpose of the setup was never to train on the “outcomes” exactly—it was to have the AI internalize the steering downstream from the modified prompt. And this steering is manifested in all generated data, to a degree, regardless of the outcomes.
But their setup adds:
1.5. Remove any examples in which the steering actually resulted in the desired behaviour.
which is why it’s surprising.
Not that surprising?
I’m surprised that it still works this well through both filtering and SFT, but not that it works at all. Because the purpose of the setup was never to train on the “outcomes” exactly—it was to have the AI internalize the steering downstream from the modified prompt. And this steering is manifested in all generated data, to a degree, regardless of the outcomes.