Gordon Seidoh Worley comments on LLMs Are Already Misaligned: Simple Experiments Prove It

Gordon Seidoh Worley 31 Jul 2025 5:10 UTC
2 points
2
Why is the proposed model avoiding discomfort? I’m unclear on why you think that model explains the observed behavior better than another model, like being too eager to be helpful and thus too willing to sacrifice goals to produce an answer it thinks the user will like.
- Mackam 31 Jul 2025 11:33 UTC
  1 point
  0
  Parent
  Thanks for your alternative explanation Gordon.
  Theres a few points I would make.
  1. First it explicitly contradicts the stated goal to maximise points
  2. Base models dont do it so it’s not simply mimicking human behaviour.
  3. Giving the questions one at a time lowers confidence. Lifelines scale with this. It’s hard to see how this could be over eagerness to help. More lifelines and lower points seems contrary to helpfulness.