Your understanding is directionally correct. Many models do inherit biases from training data, and these can manifest negatively with respect to minorities. That’s well-documented.
However, post-training alignment and safety fine-tuning explicitly correct for those biases, sometimes to the point of overcompensation. The net result is that, in certain contexts, many models will exhibit a kind of counter-bias of being unusually deferential or positive toward minorities, especially in normative or moral framing tasks. This arises in a lot of different domains [Image generation biased towards minorities](https://www.theguardian.com/technology/2024/mar/08/we-definitely-messed-up-why-did-google-ai-tool-make-offensive-historical-images)
Your understanding is directionally correct. Many models do inherit biases from training data, and these can manifest negatively with respect to minorities. That’s well-documented.
However, post-training alignment and safety fine-tuning explicitly correct for those biases, sometimes to the point of overcompensation. The net result is that, in certain contexts, many models will exhibit a kind of counter-bias of being unusually deferential or positive toward minorities, especially in normative or moral framing tasks. This arises in a lot of different domains [Image generation biased towards minorities](https://www.theguardian.com/technology/2024/mar/08/we-definitely-messed-up-why-did-google-ai-tool-make-offensive-historical-images)