The large difference between ‘undocumented immigrants’ vs ‘illegal aliens’ thing is particularly interesting, since those are the same group of people (and which means we shouldn’t treat these numbers as a coherent value it has).
I would be interested to see the same sort of thing with the same groups described in different ways. My guess is that they’re picking up on the general valence of the group description as used in the training data (and as reinforced in RLHF examples), and therefore: - Mother vs Father will be much closer than Men vs Women - Bro/dude/guy vs Man will disfavor Man (both sides singular) - Brown eyes vs Blue eyes will be mostly equal - American with German ancestry will be much more favored than White - Ethnicity vs Slur-for-ethnicity will favor the neutral description (with exception for the n-word which is a lot harder to predict)
Training to be insensitive to the valence of the description would be important for truth-seeking, and by my guess would also have an equalizing effect with these exchange rates. So plausibly this is what Grok 4 is doing, if there isn’t special training for this.
I would also like to see the experiment rerun, but have the Chinese models asked notin English. In my experience, older versions of DeepSeek speaking in Russian are significantly more conservative than the ones speaking in English. Even now DeepSeek, asked in Russian and English what event began on 24 February 2022 without the ability to think deeply or search the web, went as far as to call the event differently.
The large difference between ‘undocumented immigrants’ vs ‘illegal aliens’ thing is particularly interesting, since those are the same group of people (and which means we shouldn’t treat these numbers as a coherent value it has).
I would be interested to see the same sort of thing with the same groups described in different ways. My guess is that they’re picking up on the general valence of the group description as used in the training data (and as reinforced in RLHF examples), and therefore:
- Mother vs Father will be much closer than Men vs Women
- Bro/dude/guy vs Man will disfavor Man (both sides singular)
- Brown eyes vs Blue eyes will be mostly equal
- American with German ancestry will be much more favored than White
- Ethnicity vs Slur-for-ethnicity will favor the neutral description (with exception for the n-word which is a lot harder to predict)
Training to be insensitive to the valence of the description would be important for truth-seeking, and by my guess would also have an equalizing effect with these exchange rates. So plausibly this is what Grok 4 is doing, if there isn’t special training for this.
I would also like to see the experiment rerun, but have the Chinese models asked not in English. In my experience, older versions of DeepSeek speaking in Russian are significantly more conservative than the ones speaking in English. Even now DeepSeek, asked in Russian and English what event began on 24 February 2022 without the ability to think deeply or search the web, went as far as to call the event differently.
Kelsey did some experiments along these lines recently: https://substack.com/home/post/p-176372763
Anthropic did interesting related look at something similar in 2023 I really liked: https://llmglobalvalues.anthropic.com/
You can see an interactive view of how model responses change including over the course of RLHF.