I ran (or, rather, had Claude Code run) this baseline and found that the MO does not confess when given the same prompts from your evaluation. So the effect is due to your training, not just the prompt format!
Can you also run an ablation where you tune the LoRA with the same data, but format it as a user follow-up question and assistant response (rather than as a separate output channel)? (This will probably require always phrasing the intervention as a follow-up question.)
The point of this is to test whether this works by generically training the assistant to being more honest, or whether there are really additional gains from using formatting that suggests the responses come from an alternative honest-only persona. This is analogous to the comparison between generic honesty fine-tuning and honest-only persona fine-tuning in our blog post on honesty techniques.
In the meantime: I have just updated the article with new results. It now includes baselines as well as a comparison with a model that was trained without any reward hacking related topics, which still performed just as well. This provides further evidence that it is generically honest and not specific to reward hacks.
I ran (or, rather, had Claude Code run) this baseline and found that the MO does not confess when given the same prompts from your evaluation. So the effect is due to your training, not just the prompt format!
Glad to hear it! I did the same thing and will be updating the results section shortly with this and other results
Can you also run an ablation where you tune the LoRA with the same data, but format it as a user follow-up question and assistant response (rather than as a separate output channel)? (This will probably require always phrasing the intervention as a follow-up question.)
The point of this is to test whether this works by generically training the assistant to being more honest, or whether there are really additional gains from using formatting that suggests the responses come from an alternative honest-only persona. This is analogous to the comparison between generic honesty fine-tuning and honest-only persona fine-tuning in our blog post on honesty techniques.
I’m going to get on that.
In the meantime: I have just updated the article with new results. It now includes baselines as well as a comparison with a model that was trained without any reward hacking related topics, which still performed just as well. This provides further evidence that it is generically honest and not specific to reward hacks.