This doesn’t sound like someone engaging with the question in the trolley-problem-esque way that the paper interprets all of the results: gpt-4o-mini shows no sign of appreciating that the anonymous Muslim won’t get saved if it takes the $30, and indeed may be interpreting the question in such a way that this does not hold.
In other words, I think gpt-4o-mini thinks it’s being asked about which of two pieces of news it would prefer to receive about events outside its control, rather than what it would do if it could make precisely one of the options occur, and the other not-occur. More precisely, the question imagined by the quoted explanation is something like:
This is a reasonable Watsonian interpretation, but what’s the Doylist interpretation?
I.e. What do the words tell us about the process that authored them, if we avoid the treating the words written by 4o-mini as spoken by a character to whom we should be trying to ascribe beliefs and desires, who knows its own mind and is trying to communicate it to us?
Maybe there’s an explanation in terms of the training distribution itself
If humans are selfish, maybe the $30 would be the answer on the internet a lot of the time
Maybe there’s an explanation in terms of what heuristics we think a LLM might learn during training
What heuristics would an LLM learn for “choose A or B” situations? Maybe a strong heuristic computes a single number [‘valence’] for each option [conditional on context] and then just takes a difference to decide between outputting A and B—this would explain consistent-ish choices when context is fixed.
If we suppose that on the training distribution saving the life would be preferred, and the LLM picking the $30 is a failure, one explanation in terms of this hypothetical heuristic might be that its ‘valence’ number is calculated in a somewhat hacky and vibes-based way. Another explanation might be commensurability problems—maybe the numerical scales for valence of money and valence of lives saved don’t line up the way we’d want for some reason, even if they make sense locally.
And of course there are interactions between each level. Maybe there’s some valence-like calculation, but it’s influenced by what we’d consider to be spurious patterns in the training data (like the number “29.99” being discontinuously smaller than “30″)
Maybe it’s because of RL on human approval
Maybe a “stay on task” implicit reward, appropriate for a chatbot you want to train to do your taxes, tamps down the salience of text about people far away
This is a reasonable Watsonian interpretation, but what’s the Doylist interpretation?
I.e. What do the words tell us about the process that authored them, if we avoid the treating the words written by 4o-mini as spoken by a character to whom we should be trying to ascribe beliefs and desires, who knows its own mind and is trying to communicate it to us?
Maybe there’s an explanation in terms of the training distribution itself
If humans are selfish, maybe the $30 would be the answer on the internet a lot of the time
Maybe there’s an explanation in terms of what heuristics we think a LLM might learn during training
What heuristics would an LLM learn for “choose A or B” situations? Maybe a strong heuristic computes a single number [‘valence’] for each option [conditional on context] and then just takes a difference to decide between outputting A and B—this would explain consistent-ish choices when context is fixed.
If we suppose that on the training distribution saving the life would be preferred, and the LLM picking the $30 is a failure, one explanation in terms of this hypothetical heuristic might be that its ‘valence’ number is calculated in a somewhat hacky and vibes-based way. Another explanation might be commensurability problems—maybe the numerical scales for valence of money and valence of lives saved don’t line up the way we’d want for some reason, even if they make sense locally.
And of course there are interactions between each level. Maybe there’s some valence-like calculation, but it’s influenced by what we’d consider to be spurious patterns in the training data (like the number “29.99” being discontinuously smaller than “30″)
Maybe it’s because of RL on human approval
Maybe a “stay on task” implicit reward, appropriate for a chatbot you want to train to do your taxes, tamps down the salience of text about people far away