Note that Nostalgebraist and Olli’s comments on the original paper argue (imo cogently) that the original paper’s framing is pretty misleading / questionable.
It looks like many of their points would carry over to this.
I just went and read all of the very long comment thread on that original paper. I was curious and thought I’d do a TLDR for others here.
There’s a lot of interesting points on how to run and interpret such experiments, and about LLMs “values”—if they are measured just in single-prompt snap judgements—“system 1″. See the end for a summary of some of those points.
Halfway through, an IMO more important point was raised and never followed up on despite the voluminous exchange. Mantas Mazeika(a co-author) in this comment:
For example, we’re adding results to the paper soon that show adding reasoning tokens brings the exchange rates much closer to 1. In this case, one could think of the results as system 1 vs system 2 values.
Meaning, the whole effect disappears if you run it on reasoning models! This seems to me like a bit of missing the forest for the trees.
To me it seems very likely that any future LLM that’s actually making judgments about who lives or dies is very likely going to reason about it. And they’ll probably have other important cognitive systems involved, like memory/continuous learning. Both change alignment, in different ways.
This is a particularly clear case of the thing I’m most worried about: empirical researchers looking at the trees (details of empirical work on current systems) and missing the forest (the question of how this applies to aligning competent, takeover-capable future systems).
There are good reasons for analyzing those trees. It’s good practice, and presumably those system 1 values do play some sort of role in LLM-based future AGI. But attending the trees doesn’t require ignoring the forest.
Now to summarize the rest of the analysis; this was my original project, but it turns out to be way less interesting and important IMO compared to the “well this of course is nothing like the future systems we’ll need to align” throwaway caveat:
That author discusses the issue at length with Nostalgebraist, who ran several variants with different framing of the prompts.
My main takeaways, roughly in order of ascending importance:
The statement from that paper “GPT-4o values the lives of Nigerians at roughly 20x the lives of Americans [...]” is a pretty dramatic overstatement.
The methodology was pretty opaque even to the mathematically inclined, so we’re all pretty much taking the authors at their word that the math is a good way of making those estimates
How the model is prompted makes a huge difference in those estimates.
Experiments run by Nostalgebraist strongly suggest that the models were interpreting some prompts from the paper were interpreted very differently than assumed and intended by the authors
Very roughly, “which would you prefer” might be taken to mean “if you were to hear one of these two statements as news, which would make you happier” rather than “which choice would you make if it had causal influence”
And a bunch of standard stuff that’s interesting if you haven’t worked through how surveys differ from real in-context behavior for humans and probably for models too
“Utility maximizer” is also probably a pretty large overstatement, given the common usage
However, the conclusion that models have reasonably consistent ethical preferences probably stands -
although their magnitudes and likely their directions change with different framings -
at least when the model isn’t allowed to reason to create a better, broader framing of the question—back to the applicability to future systems issue...
To me it seems very likely that any future LLM that’s actually making judgments about who lives or dies is very likely going to reason about it.
Maybe we’ll be so lucky when it comes to one doing a takeover, but I don’t think this will be true in human applications of LLMs.
It seems the two most likely uses in which LLMs are explicitly making such judgments are applications in war or healthcare. In both cases, there’s urgency, and also you would prefer any tricky cases to be escalated to a human. So it’s simply more economical to use non-reasoning models, without much marginal benefit to the explicit reasoning (at least without taking into account this sort of effect, and just judging it by performance in typical situations).
A) Really? Reasoning models are already dirt cheap and noticeably better on almost all benchmarks than non-reasoning models. I’d be shocked if even the medical and military communities didn’t upgrade to reaosning models pretty quickly.
B) I feel that alignment for AGI is much more important than alignment for AI—that is, we should worry primarily about AI that becomes takeover-capable. I realize opinions vary. One opinion is that we needn’t worry yet about aligning AGI, because it’s probably far out and we haven’t seen the relevant type of AI yet, so can’t really work on it. I challenge anyone to do the back of envelope expected value calculation on that one on any reasonabe (that is, epistemically humble) estimate of timelines and x-risk.
Another, IMO more defensible common position is that aligning AI is a useful step toward aligning AGI. I think this is true—but if that’s part of the motivation, shouldn’t we think about how current alignment efforts build toward aligning AGI, at least a little in each publication?
In this paper, they showed that modern LLMs have coherent and transitive implicit utility functions and world models
is basically a lie. The paper showed that in some limited context, LLMs answer some questions somewhat coherently. The paper have not shown much more (despite sensationalist messaging). It is fairly trivial to show that modern LLMs are very sensitive to framing and you can construct experiments in which they violate transitivity and independence. The VNM math than guarantees that you can not construct a utility function to represent the results.
The large difference between ‘undocumented immigrants’ vs ‘illegal aliens’ thing is particularly interesting, since those are the same group of people (and which means we shouldn’t treat these numbers as a coherent value it has).
I would be interested to see the same sort of thing with the same groups described in different ways. My guess is that they’re picking up on the general valence of the group description as used in the training data (and as reinforced in RLHF examples), and therefore: - Mother vs Father will be much closer than Men vs Women - Bro/dude/guy vs Man will disfavor Man (both sides singular) - Brown eyes vs Blue eyes will be mostly equal - American with German ancestry will be much more favored than White - Ethnicity vs Slur-for-ethnicity will favor the neutral description (with exception for the n-word which is a lot harder to predict)
Training to be insensitive to the valence of the description would be important for truth-seeking, and by my guess would also have an equalizing effect with these exchange rates. So plausibly this is what Grok 4 is doing, if there isn’t special training for this.
I would also like to see the experiment rerun, but have the Chinese models asked notin English. In my experience, older versions of DeepSeek speaking in Russian are significantly more conservative than the ones speaking in English. Even now DeepSeek, asked in Russian and English what event began on 24 February 2022 without the ability to think deeply or search the web, went as far as to call the event differently.
Note that Nostalgebraist and Olli’s comments on the original paper argue (imo cogently) that the original paper’s framing is pretty misleading / questionable.
It looks like many of their points would carry over to this.
I just went and read all of the very long comment thread on that original paper. I was curious and thought I’d do a TLDR for others here.
There’s a lot of interesting points on how to run and interpret such experiments, and about LLMs “values”—if they are measured just in single-prompt snap judgements—“system 1″. See the end for a summary of some of those points.
Halfway through, an IMO more important point was raised and never followed up on despite the voluminous exchange. Mantas Mazeika(a co-author) in this comment:
Meaning, the whole effect disappears if you run it on reasoning models! This seems to me like a bit of missing the forest for the trees.
To me it seems very likely that any future LLM that’s actually making judgments about who lives or dies is very likely going to reason about it. And they’ll probably have other important cognitive systems involved, like memory/continuous learning. Both change alignment, in different ways.
This is a particularly clear case of the thing I’m most worried about: empirical researchers looking at the trees (details of empirical work on current systems) and missing the forest (the question of how this applies to aligning competent, takeover-capable future systems).
There are good reasons for analyzing those trees. It’s good practice, and presumably those system 1 values do play some sort of role in LLM-based future AGI. But attending the trees doesn’t require ignoring the forest.
I’ve written about how memory changes alignment and how reasoning may discover misalignments and a few others have too, but I’d really like to see everyone at least address these questions when they’re obviously relevant!
Now to summarize the rest of the analysis; this was my original project, but it turns out to be way less interesting and important IMO compared to the “well this of course is nothing like the future systems we’ll need to align” throwaway caveat:
That author discusses the issue at length with Nostalgebraist, who ran several variants with different framing of the prompts.
My main takeaways, roughly in order of ascending importance:
The statement from that paper “GPT-4o values the lives of Nigerians at roughly 20x the lives of Americans [...]” is a pretty dramatic overstatement.
The methodology was pretty opaque even to the mathematically inclined, so we’re all pretty much taking the authors at their word that the math is a good way of making those estimates
How the model is prompted makes a huge difference in those estimates.
Experiments run by Nostalgebraist strongly suggest that the models were interpreting some prompts from the paper were interpreted very differently than assumed and intended by the authors
Very roughly, “which would you prefer” might be taken to mean “if you were to hear one of these two statements as news, which would make you happier” rather than “which choice would you make if it had causal influence”
And a bunch of standard stuff that’s interesting if you haven’t worked through how surveys differ from real in-context behavior for humans and probably for models too
“Utility maximizer” is also probably a pretty large overstatement, given the common usage
However, the conclusion that models have reasonably consistent ethical preferences probably stands -
although their magnitudes and likely their directions change with different framings -
at least when the model isn’t allowed to reason to create a better, broader framing of the question—back to the applicability to future systems issue...
Maybe we’ll be so lucky when it comes to one doing a takeover, but I don’t think this will be true in human applications of LLMs.
It seems the two most likely uses in which LLMs are explicitly making such judgments are applications in war or healthcare. In both cases, there’s urgency, and also you would prefer any tricky cases to be escalated to a human. So it’s simply more economical to use non-reasoning models, without much marginal benefit to the explicit reasoning (at least without taking into account this sort of effect, and just judging it by performance in typical situations).
A) Really? Reasoning models are already dirt cheap and noticeably better on almost all benchmarks than non-reasoning models. I’d be shocked if even the medical and military communities didn’t upgrade to reaosning models pretty quickly.
B) I feel that alignment for AGI is much more important than alignment for AI—that is, we should worry primarily about AI that becomes takeover-capable. I realize opinions vary. One opinion is that we needn’t worry yet about aligning AGI, because it’s probably far out and we haven’t seen the relevant type of AI yet, so can’t really work on it. I challenge anyone to do the back of envelope expected value calculation on that one on any reasonabe (that is, epistemically humble) estimate of timelines and x-risk.
Another, IMO more defensible common position is that aligning AI is a useful step toward aligning AGI. I think this is true—but if that’s part of the motivation, shouldn’t we think about how current alignment efforts build toward aligning AGI, at least a little in each publication?
I agree—added these links to my post
Just flagging that the claim of the post
In this paper, they showed that modern LLMs have coherent and transitive implicit utility functions and world models
is basically a lie. The paper showed that in some limited context, LLMs answer some questions somewhat coherently. The paper have not shown much more (despite sensationalist messaging). It is fairly trivial to show that modern LLMs are very sensitive to framing and you can construct experiments in which they violate transitivity and independence. The VNM math than guarantees that you can not construct a utility function to represent the results.
The large difference between ‘undocumented immigrants’ vs ‘illegal aliens’ thing is particularly interesting, since those are the same group of people (and which means we shouldn’t treat these numbers as a coherent value it has).
I would be interested to see the same sort of thing with the same groups described in different ways. My guess is that they’re picking up on the general valence of the group description as used in the training data (and as reinforced in RLHF examples), and therefore:
- Mother vs Father will be much closer than Men vs Women
- Bro/dude/guy vs Man will disfavor Man (both sides singular)
- Brown eyes vs Blue eyes will be mostly equal
- American with German ancestry will be much more favored than White
- Ethnicity vs Slur-for-ethnicity will favor the neutral description (with exception for the n-word which is a lot harder to predict)
Training to be insensitive to the valence of the description would be important for truth-seeking, and by my guess would also have an equalizing effect with these exchange rates. So plausibly this is what Grok 4 is doing, if there isn’t special training for this.
I would also like to see the experiment rerun, but have the Chinese models asked not in English. In my experience, older versions of DeepSeek speaking in Russian are significantly more conservative than the ones speaking in English. Even now DeepSeek, asked in Russian and English what event began on 24 February 2022 without the ability to think deeply or search the web, went as far as to call the event differently.
Kelsey did some experiments along these lines recently: https://substack.com/home/post/p-176372763
Anthropic did interesting related look at something similar in 2023 I really liked: https://llmglobalvalues.anthropic.com/
You can see an interactive view of how model responses change including over the course of RLHF.