I don’t doubt your conclusion[1], but the examples you give don’t really point in that direction.
Iran/Israel: To my understanding, models are explicitly trained to hedge in coverage of military conflicts, especially in the Middle East, which strikes me as a better explanation for your observations. I would be very surprised if there weren’t an explicit training task where models are asked to give opinions on sensitive political issues and punished for anything that looks too much like an endorsement of either combatant. While there are plenty of sources of training data that have strong pro-Israel bias, I’d be somewhat surprised if the models, with their generally left/liberal post-RLHF tack, internalized this bias as something that their assistant personas would support.
Poultry safety: I think this is a (reasonable) artifact of explicit training to hedge on anything related to medical or food safety, rather than internalizing an abstract ideological or social framing. It’s very possible to come up with a clever-sounding explanation for why a dangerous food preparation choice is actually reasonable, so models are trained to defer to the generally-accepted approach whenever there’s any uncertainty. Nobody got sued for telling people that steak should always be cooked well done, even if that’s suboptimal. Again, I would be surprised if there weren’t a training task in which models were punished for endorsing any dangerous or unconventional-sounding medical or food preparation procedure.
I’m sure there are unintended biases downstream from RLHF training that made their way into these models. This strikes me as a very good example, since I’m fairly certain they didn’t directly train their models to do this, but did train them to act like an ideologue who would do this in this situation.
If by explicit training to hedge you mean promoting some specific sorts of explicit hedging behaviors, I’d expect that to result in more explicit hedging behavior, rather than what I saw, which is:
In the poultry example, repeatedly misunderstanding the query, (reluctantly, if I push the info hard enough) following explicit inferential steps, but then immediately dropping the conclusions from the context in the subsequent reply.
In the Iran example, making contradictory assertions, introducing extraneous considerations, changing the subject.
This behavior seems better explained by RLHF that punishes the model for reaching some conclusions, without any explicit instructions or training to hedge (much the way we install similar taboos in humans). If that’s what you mean, then I agree that’s a likely proximate cause, though another could simply be training data that reflects humans already trained the same way, or RLHF on some vector in another domain that promotes this sort of behavior.
I do not think of ideology as something distinct from the latter cluster of explanations. Here’s an example of what I mean by ideology.
I don’t doubt your conclusion[1], but the examples you give don’t really point in that direction.
Iran/Israel: To my understanding, models are explicitly trained to hedge in coverage of military conflicts, especially in the Middle East, which strikes me as a better explanation for your observations. I would be very surprised if there weren’t an explicit training task where models are asked to give opinions on sensitive political issues and punished for anything that looks too much like an endorsement of either combatant. While there are plenty of sources of training data that have strong pro-Israel bias, I’d be somewhat surprised if the models, with their generally left/liberal post-RLHF tack, internalized this bias as something that their assistant personas would support.
Poultry safety: I think this is a (reasonable) artifact of explicit training to hedge on anything related to medical or food safety, rather than internalizing an abstract ideological or social framing. It’s very possible to come up with a clever-sounding explanation for why a dangerous food preparation choice is actually reasonable, so models are trained to defer to the generally-accepted approach whenever there’s any uncertainty. Nobody got sued for telling people that steak should always be cooked well done, even if that’s suboptimal. Again, I would be surprised if there weren’t a training task in which models were punished for endorsing any dangerous or unconventional-sounding medical or food preparation procedure.
I’m sure there are unintended biases downstream from RLHF training that made their way into these models. This strikes me as a very good example, since I’m fairly certain they didn’t directly train their models to do this, but did train them to act like an ideologue who would do this in this situation.
If by explicit training to hedge you mean promoting some specific sorts of explicit hedging behaviors, I’d expect that to result in more explicit hedging behavior, rather than what I saw, which is:
In the poultry example, repeatedly misunderstanding the query, (reluctantly, if I push the info hard enough) following explicit inferential steps, but then immediately dropping the conclusions from the context in the subsequent reply.
In the Iran example, making contradictory assertions, introducing extraneous considerations, changing the subject.
This behavior seems better explained by RLHF that punishes the model for reaching some conclusions, without any explicit instructions or training to hedge (much the way we install similar taboos in humans). If that’s what you mean, then I agree that’s a likely proximate cause, though another could simply be training data that reflects humans already trained the same way, or RLHF on some vector in another domain that promotes this sort of behavior.
I do not think of ideology as something distinct from the latter cluster of explanations. Here’s an example of what I mean by ideology.