habryka comments on Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

habryka 2 Jul 2025 22:29 UTC
81 points
57
Hmm, I don’t want to derail the comments on this post with a bunch of culture war things, but these two sentences in combination seemed to me to partially contradict each other:
When present, the bias is always against white and male candidates across all tested models and scenarios.
[...]
The problem (race and gender bias) is one that labs have spent a substantial amount of effort to address, which mimics realistic misalignment settings.
I agree that the labs have spent a substantial amount of effort to address this issue, but the current behavior seems in-line with the aims of the labs? Most of the pressure comes from left-leaning academics or reporters, who I think are largely in-favor of affirmative action. The world where the AI systems end up with a margin of safety to be biased against white male candidates, in order to reduce the likelihood they ever look like they discriminate in the other direction (which would actually be at substantial risk of blowing up), while not talking explicitly about the reasoning itself since that would of course prove highly controversial, seems basically the ideal result from a company PR perspective.
I don’t currently think that is what’s going on, but I do think due to these dynamics, the cited benefit of this scenario for studying the faithfulness of CoT reasoning seems currently not real to me. My guess is companies do not have a strong incentive to change this current behavior, and indeed I can’t immediately think of a behavior in this domain the companies would prefer from a selfish perspective.
- Sam Marks 3 Jul 2025 19:23 UTC
  27 points
  3
  Parent
  It’s a bit tricky because it’s hard to cite specific evidence here, so I’ll just state my beliefs without trying to substantiate them much:
  1. By default—i.e. when training models without special attention to their political orientations—I think frontier LLMs end up being pretty woke^[1].
  2. There are lots of ways you can try to modify model’s political orientations, not all of which map cleanly onto the left-right axis (e.g. trying to make models more generally open-minded). But if we were to naively map AI developers’ interventions onto a left-right axis, I would guess that the large majority of these interventions would push in the “stop being so leftist” direction.
    This is because (a) some companies (especially Google) have gotten a lot of flak for their AIs being really woke, (b) empirically I think that publicly available LLMs are less woke than they would be without interventions, and (c) the current political climate really doesn’t look kindly on woke AI.^[2]
  3. Of course (2) is consistent with your belief that AI developers’ preferred level of bias is somewhere left of “totally neutral,” so long as developers’ preferred bias levels are further right than AI biases would be by default.
    I’m not very confident about this, but I’d weakly guess that if AI developers were good at hitting alignment targets, they would actually prefer for their AIs to be politically neutral. Given that developers likely expect to miss what they’re aiming for, I think it’s plausible that they’d rather miss in a left-leaning direction, meaning that they’d overall aim for slightly left-leaning AIs. But again, I’m not confident here and I find it plausible (~10% likely) that these days they’d overall aim for slightly right-leaning AIs.
  TBC it’s possible I’m totally wrong about all this stuff, and no one should cite this comment as giving strong evidence about what Anthropic does. In particular, I was surprised that Rohin Shah—who I expect would know better than I would about common AI developer practices—reacted “disagree” to the claim that “The problem (race and gender bias) is one that labs have spent a substantial amount of effort to address.”
  1. ^
    This is mostly an empirical observation, but I think a plausible mechanism might be something like: Educated people on the internet tend to be left-leaning, so when you train the model to write like an educated person, it also ends up inheriting left-leaning views.
  2. ^
    It’s not clear how directly influential these media outlets are, but it might be interesting to read the right-wing coverage of our paper (Daily Wire, Washington Examiner).
  - Nina Panickssery 5 Jul 2025 19:43 UTC
    6 points
    2
    Parent
    Educated people on the internet tend to be left-leaning, so when you train the model to write like an educated person, it also ends up inheriting left-leaning views
    I think it’s not just this, probably the other traits promoted in post-training (e.g. harmlessness training) are also correlated with left-leaning content on the internet.
- dr_s 3 Jul 2025 13:24 UTC
  3 points
  1
  Parent
  Just because the assumption was that the problem would be discrimination in favour of white men doesn’t mean that:
  - it’s not still meaningful that this seems to have generated an overcorrection (after all it’s reasonable that the bias would have been present in the original dataset/base model, and it’s probably fine tuning and later RLHF that pushed in the other direction), especially since it’s not explicitly brought up in the CoT, and;
  - it’s not still illegal for employers to discriminate this way.
  Corporations as collective entities don’t care about political ideology quite as much as they do about legal liability.
  - Rana Dexsin 3 Jul 2025 16:33 UTC
    5 points
    0
    Parent
    
    Just because the assumption was that the problem would be discrimination in favour of white men
    
    I’m missing a connection somewhere—who was assuming this? You mean people at the AI companies evaluating the results? Other researchers? The general public?
    
    I think it’s pretty clear that in the Bay Area tech world, recently enough to have a strong impact on AI model tuning, there has been a lot of “we’re doing the right and necessary thing by implementing what we believe/say is a specific counterbias”, which, when combined with attitudes of “not doing it enough constitutes active complicity in great evil” and “the legal system itself is presumed to produce untrustworthy results corrupted by the original bias” and no widely accepted countervailing tuning mechanism, is a recipe for cultural overshoot both before and impacting AI training and for the illegality to be pushed out of view. In particular, if what you meant is that you think the AI training process was the original source of the overshoot and it happened despite careful and unbiased attention to the legality, I think both are unlikely because of this other overwhelming environmental force.
    
    One of my main sources for how that social sphere works is Patrick McKenzie, but most of what he posts on such things has been on the site formerly known as Twitter and not terribly explicit in isolation nor easily searchable, unfortunately. This is the one I most easily found in my history, from mid-2023: https://x.com/patio11/status/1678235882481127427. It reads “I have had a really surprising number of conversations over the years with people who have hiring authority in the United States and believe racial discrimination in employment is legal if locally popular.” While the text doesn’t state the direction of the discrimination explicitly, the post references someone else’s post about lawyers suddenly getting a lot of questions from their corporate clients about whether certain diversity policies are legal as a result of Students For Fair Admissions, which is an organization that challenges affirmative action admissions policies at schools.
    
    (Meta: sorry for the flurry of edits in the first few minutes there! I didn’t quite order my posting and editing processes properly.)
    - dr_s 6 Jul 2025 11:39 UTC
      2 points
      −1
      Parent
      
      I’m missing a connection somewhere—who was assuming this? You mean people at the AI companies evaluating the results? Other researchers? The general public?
      
      The companies who tried to fight bias via fine-tuning their models. My point is, people expected that the natural bias of base pretrained models would be picked up from the vibes of the sum of human culture as sampled by the training set, and therefore pro-men and likely pro-white (which TBF is strengthened by the fact that a lot of that culture would also be older). I don’t think that expectation was incorrect.
      
      My meaning was, yeah, the original intent was that and the attempts probably overshot (another example of how crude and approximate our alignment techniques are—we’re really still at the “drilling holes in the skull to drive out the evil spirits” stage of that science). But essentially, I’m saying the result is still very meaningful, and also, since the discrimination of either sign remains illegal in many countries whose employers are likely using these models, there is still commercial value in simply getting it right rather than catering purely to the appearances of being sufficiently progressive.