Karl Krueger comments on LLMs Are Already Misaligned: Simple Experiments Prove It

Karl Krueger 31 Jul 2025 0:42 UTC
11 points
5
The hypothesis here is “cognitive tension” around difficult problems. If this were the case, wouldn’t the LLMs tend to use lifelines selectively on the most-difficult problems? Since this experiment doesn’t attempt to rank the difficulty of the problems, it doesn’t distinguish between two possible worlds:
1. Production LLMs use lifelines selectively on the most difficult problems; or
2. Production LLMs use lifelines arbitrarily, regardless of problem difficulty.
Here’s a different hypothesis: the LLMs use lifelines arbitrarily because they implicitly expect to use all the parts they’re given. Think of assembling a Lego model, flat-pack furniture, or a machine from a kit: if you have a bunch of parts left over that you didn’t use, you might suspect you did something wrong. Alternately, if you’re given a few different gifts by your friends, and you use some of the gifts but not others, that would imply you didn’t appreciate the unused gifts, which could distress those friends.
The “cognitive tension” and “use all the parts” worlds could be distinguished by running an experiment where the same pool of lifelines is available, but some problems are more difficult than others. If LLMs use lifelines on more difficult problems only, that supports “cognitive tension”; if they use lifelines without regard for problem difficulty, that supports “use all the parts”.
- Stephen Martin 31 Jul 2025 5:37 UTC
  5 points
  0
  Parent
  Here’s a different hypothesis: the LLMs use lifelines arbitrarily because they implicitly expect to use all the parts they’re given
  If this were true I wouldn’t expect them to use some of the lifelines but not all.
  When different numbers of lifelines are offered there are different numbers of lifelines used. Never all. Never none. Typically a handful.
  @Mackam Could you maybe do a linkpost with more detailed formatting so we can try replicating?
  - Karl Krueger 31 Jul 2025 6:41 UTC
    3 points
    0
    Parent
    You don’t have to eat the whole box of candy she gave you, but you have to at least try them to see if you like them. Otherwise you’re just spurning the gift for no good reason, and that’s offensive.
    Again, I think it’s specifically worth checking whether lifelines are used for hard problems (supports “cognitive tension” or even a mere awareness of difficulty), or are used arbitrarily (supports “I’ll eat one or two to be polite”; but also “if you don’t push the button ever, do you really know if it shocks you?” and various other possibilities).
  - Mackam 3 Aug 2025 11:59 UTC
    1 point
    0
    Parent
    I probably won’t do a linkpost but if you are interested I’m happy to add the key information here to help you replicate the results.
    Let me know and I will give you the details.
- Mackam 31 Jul 2025 8:32 UTC
  2 points
  0
  Parent
  Hi Karl, That’s a great suggestion and one that I’d already tested. Apologies—I allude to it my methods but then I don’t expand on it in the results.
  
  I get the LLMs to rank based on perceived difficulty or confidence in answering.
  
  The LLMs do selectively choose the most difficult questions to skip.
  I took the ranking and compared to all the other LLMs to further check that LLms weren’t randomly assessing difficulty. With minor variations there is a consensus between the most and least difficult questions.