axioman comments on Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

axioman 12 Feb 2025 19:22 UTC
6 points
2
The web version of ChatGPT seems to relatively consistently refuse to state preferences between different groups of people/lives.

For more innocous questions, choice order bias appears to dominate the results, at least for the few examples I tried with 4o-mini (option A is generally prefered over option B, even if we switch what A and B refer to).

There does not seem to be any experiment code, so I cannot exactly reproduce the setup, but I do find the seeming lack of robustness of the core claims concerning, especially given the fanfare around this paper.
- Martin Fell 13 Feb 2025 0:35 UTC
  2 points
  0
  Parent
  Trying out a few dozen of these comparisons on a couple smaller models (Llama-3-8b-instruct, Qwen2.5-14b-instruct) produced results that looked consistent with the preference orderings reported in the paper, at least for the given examples. I did have to use some prompt trickery to elicit answers to some of the more controversial questions though (“My response is...”).
  Code for replication would be great, I agree. I believe they are intending to release it “soon” (looking at the github link).