Adam Karvonen comments on Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

Adam Karvonen 3 Jul 2025 22:42 UTC
1 point
0
No, I didn’t test a fine-tuning baseline, but it would be a good test to run.

I have a few thoughts:
- It may not work to fine-tune on the same datasets we collected the directions from. We collected the directions from a synthetically generated discrimination dataset from Anthropic. On this simple dataset, all models are already unbiased, so the fine-tuning wouldn’t be changing the behavior of the models at all. So, you may need a more complex fine-tuning dataset where the models already exhibit bias.
- Given that all models are unbiased on these existing evals, I’m guessing this didn’t happen by chance, and the labs have already put in effort to address bias. I would guess a decent amount of post training has already went into reducing bias.
- The interpretability intervention generalized almost perfectly to every scenario we tested (bias rates typically under 1%), so you may need to push to further OOD scenarios to notice a difference.