Sam Marks comments on Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

Sam Marks 3 Jul 2025 19:23 UTC
27 points
3
It’s a bit tricky because it’s hard to cite specific evidence here, so I’ll just state my beliefs without trying to substantiate them much:
1. By default—i.e. when training models without special attention to their political orientations—I think frontier LLMs end up being pretty woke^[1].
2. There are lots of ways you can try to modify model’s political orientations, not all of which map cleanly onto the left-right axis (e.g. trying to make models more generally open-minded). But if we were to naively map AI developers’ interventions onto a left-right axis, I would guess that the large majority of these interventions would push in the “stop being so leftist” direction.
  1. This is because (a) some companies (especially Google) have gotten a lot of flak for their AIs being really woke, (b) empirically I think that publicly available LLMs are less woke than they would be without interventions, and (c) the current political climate really doesn’t look kindly on woke AI.^[2]
3. Of course (2) is consistent with your belief that AI developers’ preferred level of bias is somewhere left of “totally neutral,” so long as developers’ preferred bias levels are further right than AI biases would be by default.
  1. I’m not very confident about this, but I’d weakly guess that if AI developers were good at hitting alignment targets, they would actually prefer for their AIs to be politically neutral. Given that developers likely expect to miss what they’re aiming for, I think it’s plausible that they’d rather miss in a left-leaning direction, meaning that they’d overall aim for slightly left-leaning AIs. But again, I’m not confident here and I find it plausible (~10% likely) that these days they’d overall aim for slightly right-leaning AIs.
TBC it’s possible I’m totally wrong about all this stuff, and no one should cite this comment as giving strong evidence about what Anthropic does. In particular, I was surprised that Rohin Shah—who I expect would know better than I would about common AI developer practices—reacted “disagree” to the claim that “The problem (race and gender bias) is one that labs have spent a substantial amount of effort to address.”
1. ^
  This is mostly an empirical observation, but I think a plausible mechanism might be something like: Educated people on the internet tend to be left-leaning, so when you train the model to write like an educated person, it also ends up inheriting left-leaning views.
2. ^
  It’s not clear how directly influential these media outlets are, but it might be interesting to read the right-wing coverage of our paper (Daily Wire, Washington Examiner).
- Nina Panickssery 5 Jul 2025 19:43 UTC
  6 points
  2
  Parent
  Educated people on the internet tend to be left-leaning, so when you train the model to write like an educated person, it also ends up inheriting left-leaning views
  I think it’s not just this, probably the other traits promoted in post-training (e.g. harmlessness training) are also correlated with left-leaning content on the internet.