Another mitigation against reward hacking (inspired by this):
Train the model to generate 2 responses:
A reward maximizing response --> Use that when computing the outcome based rewards
An “expected” response --> Optionally score that using some high quality human feedback
This might beat “just use high quality human feedback after RL” because
You don’t need to fight as hard against the dishonesty you instilled with outcome based rewards
The self-criticism might incentivize the right kind of personality
You get sth untrusted-monitoring-shaped for free-ish
This is similar to “tell the model to reward hack during training” mixed with “use high quality feedback at the end” but it’s all intermixed throughout training and there’s self critique.
It’s probably too complex/fancy for production usage, except if it was shown to work extremely well—which I think is <20% likely.
Another mitigation against reward hacking (inspired by this):
Train the model to generate 2 responses:
A reward maximizing response --> Use that when computing the outcome based rewards
An “expected” response --> Optionally score that using some high quality human feedback
This might beat “just use high quality human feedback after RL” because
You don’t need to fight as hard against the dishonesty you instilled with outcome based rewards
The self-criticism might incentivize the right kind of personality
You get sth untrusted-monitoring-shaped for free-ish
This is similar to “tell the model to reward hack during training” mixed with “use high quality feedback at the end” but it’s all intermixed throughout training and there’s self critique.
It’s probably too complex/fancy for production usage, except if it was shown to work extremely well—which I think is <20% likely.