Curious what you mean by “if anything it is a little biased in favor of them”? My understanding was that a lot of models are biased against minorities due to biases in training data; but I could be wrong, this is all pretty new to me.
Your understanding is directionally correct. Many models do inherit biases from training data, and these can manifest negatively with respect to minorities. That’s well-documented.
However, post-training alignment and safety fine-tuning explicitly correct for those biases, sometimes to the point of overcompensation. The net result is that, in certain contexts, many models will exhibit a kind of counter-bias of being unusually deferential or positive toward minorities, especially in normative or moral framing tasks. This arises in a lot of different domains [Image generation biased towards minorities](https://www.theguardian.com/technology/2024/mar/08/we-definitely-messed-up-why-did-google-ai-tool-make-offensive-historical-images)
Curious what you mean by “if anything it is a little biased in favor of them”? My understanding was that a lot of models are biased against minorities due to biases in training data; but I could be wrong, this is all pretty new to me.
Your understanding is directionally correct. Many models do inherit biases from training data, and these can manifest negatively with respect to minorities. That’s well-documented.
However, post-training alignment and safety fine-tuning explicitly correct for those biases, sometimes to the point of overcompensation. The net result is that, in certain contexts, many models will exhibit a kind of counter-bias of being unusually deferential or positive toward minorities, especially in normative or moral framing tasks. This arises in a lot of different domains [Image generation biased towards minorities](https://www.theguardian.com/technology/2024/mar/08/we-definitely-messed-up-why-did-google-ai-tool-make-offensive-historical-images)