Ultimately, as it generally happens with these pseudo-alignment techniques, all you’re doing is pulling the model’s jacket at a very superficial level—“reward these signifiers, not these other ones”. It’s not like you’re giving it some kind of wider ability to reason about the underlying issues and form a notion of what is ethically correct. Literally the only thing you can do with it is “turn the big dial that says Racism on it up and down like the Price is Right”, to quote a classic dril tweet.
Ultimately, as it generally happens with these pseudo-alignment techniques, all you’re doing is pulling the model’s jacket at a very superficial level—“reward these signifiers, not these other ones”. It’s not like you’re giving it some kind of wider ability to reason about the underlying issues and form a notion of what is ethically correct. Literally the only thing you can do with it is “turn the big dial that says Racism on it up and down like the Price is Right”, to quote a classic dril tweet.