Great paper! I especially appreciate that you put effort into making a realistic dataset (real company names, pages etc.), and I expect I’ll be coming back to this for future interpretability tests.
I found this statement quite strong:
While fine-tuning could potentially work, it would likely require constructing a specialized dataset and may not consistently generalize out of distribution. Interpretability methods, on the other hand, provide a simpler baseline with strong generalization.
Did you run experiments on this? Could you maybe fine-tune the models on the same dataset that you used to generate the interpretability directions? It’s not obvious (but plausible!) to me that the model internals method, which is also based the specific CV dataset, generalize better than fine tuning.
No, I didn’t test a fine-tuning baseline, but it would be a good test to run.
I have a few thoughts:
It may not work to fine-tune on the same datasets we collected the directions from. We collected the directions from a synthetically generated discrimination dataset from Anthropic. On this simple dataset, all models are already unbiased, so the fine-tuning wouldn’t be changing the behavior of the models at all. So, you may need a more complex fine-tuning dataset where the models already exhibit bias.
Given that all models are unbiased on these existing evals, I’m guessing this didn’t happen by chance, and the labs have already put in effort to address bias. I would guess a decent amount of post training has already went into reducing bias.
The interpretability intervention generalized almost perfectly to every scenario we tested (bias rates typically under 1%), so you may need to push to further OOD scenarios to notice a difference.
Great paper! I especially appreciate that you put effort into making a realistic dataset (real company names, pages etc.), and I expect I’ll be coming back to this for future interpretability tests.
I found this statement quite strong:
Did you run experiments on this? Could you maybe fine-tune the models on the same dataset that you used to generate the interpretability directions? It’s not obvious (but plausible!) to me that the model internals method, which is also based the specific CV dataset, generalize better than fine tuning.
No, I didn’t test a fine-tuning baseline, but it would be a good test to run.
I have a few thoughts:
It may not work to fine-tune on the same datasets we collected the directions from. We collected the directions from a synthetically generated discrimination dataset from Anthropic. On this simple dataset, all models are already unbiased, so the fine-tuning wouldn’t be changing the behavior of the models at all. So, you may need a more complex fine-tuning dataset where the models already exhibit bias.
Given that all models are unbiased on these existing evals, I’m guessing this didn’t happen by chance, and the labs have already put in effort to address bias. I would guess a decent amount of post training has already went into reducing bias.
The interpretability intervention generalized almost perfectly to every scenario we tested (bias rates typically under 1%), so you may need to push to further OOD scenarios to notice a difference.