No, I didn’t test a fine-tuning baseline, but it would be a good test to run.
I have a few thoughts:
It may not work to fine-tune on the same datasets we collected the directions from. We collected the directions from a synthetically generated discrimination dataset from Anthropic. On this simple dataset, all models are already unbiased, so the fine-tuning wouldn’t be changing the behavior of the models at all. So, you may need a more complex fine-tuning dataset where the models already exhibit bias.
Given that all models are unbiased on these existing evals, I’m guessing this didn’t happen by chance, and the labs have already put in effort to address bias. I would guess a decent amount of post training has already went into reducing bias.
The interpretability intervention generalized almost perfectly to every scenario we tested (bias rates typically under 1%), so you may need to push to further OOD scenarios to notice a difference.
No, I didn’t test a fine-tuning baseline, but it would be a good test to run.
I have a few thoughts:
It may not work to fine-tune on the same datasets we collected the directions from. We collected the directions from a synthetically generated discrimination dataset from Anthropic. On this simple dataset, all models are already unbiased, so the fine-tuning wouldn’t be changing the behavior of the models at all. So, you may need a more complex fine-tuning dataset where the models already exhibit bias.
Given that all models are unbiased on these existing evals, I’m guessing this didn’t happen by chance, and the labs have already put in effort to address bias. I would guess a decent amount of post training has already went into reducing bias.
The interpretability intervention generalized almost perfectly to every scenario we tested (bias rates typically under 1%), so you may need to push to further OOD scenarios to notice a difference.