dr_s comments on Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

dr_s 9 Jul 2025 9:30 UTC
1 point
0
Ultimately, as it generally happens with these pseudo-alignment techniques, all you’re doing is pulling the model’s jacket at a very superficial level—“reward these signifiers, not these other ones”. It’s not like you’re giving it some kind of wider ability to reason about the underlying issues and form a notion of what is ethically correct. Literally the only thing you can do with it is “turn the big dial that says Racism on it up and down like the Price is Right”, to quote a classic dril tweet.