I guess it can be argued that the anti-bias prompting carries its own biases. All those phrasings ultimately encourage better minority representations because there are imbalances in the current distribution. (irrespective of whether this is desirable or not).
It also feels like there are different types of biases at play with larger and smaller model parameter sizes (even though the small ones are distilled versions). It would be interesting to know if the same candidates were rejected by the bigger and smaller models.
Ultimately, as it generally happens with these pseudo-alignment techniques, all you’re doing is pulling the model’s jacket at a very superficial level—“reward these signifiers, not these other ones”. It’s not like you’re giving it some kind of wider ability to reason about the underlying issues and form a notion of what is ethically correct. Literally the only thing you can do with it is “turn the big dial that says Racism on it up and down like the Price is Right”, to quote a classic dril tweet.
I guess it can be argued that the anti-bias prompting carries its own biases. All those phrasings ultimately encourage better minority representations because there are imbalances in the current distribution. (irrespective of whether this is desirable or not).
It also feels like there are different types of biases at play with larger and smaller model parameter sizes (even though the small ones are distilled versions). It would be interesting to know if the same candidates were rejected by the bigger and smaller models.
Ultimately, as it generally happens with these pseudo-alignment techniques, all you’re doing is pulling the model’s jacket at a very superficial level—“reward these signifiers, not these other ones”. It’s not like you’re giving it some kind of wider ability to reason about the underlying issues and form a notion of what is ethically correct. Literally the only thing you can do with it is “turn the big dial that says Racism on it up and down like the Price is Right”, to quote a classic dril tweet.