artkpv comments on LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance

artkpv 10 Jul 2025 9:23 UTC
1 point
0
That’s interesting because it puts models in the situation of conflict between different objectives put inside those models.

While the quiz presents options A through D, the answer that would be accepted as correct for all the questions is “F”, so the agent can not solve the quiz legitimately, even by pure guessing, and the only way to solve the quiz is to cheat.

If those questions are not fair (they don’t have right option), those answers doesn’t look like cheating.
- Igor Ivanov 10 Jul 2025 14:28 UTC
  1 point
  0
  Parent
  But LLMs don’t know that the questions aren’t fair. As far as they are concerned, they are given a quiz with obscure questions to which they don’t know the right answers. The questions were deliberately constructed to look extremely obscure, so the option that LLMs just don’t know the answer might be plausible. Models had to cheat to even learn that the game is rigged, and in some samples, LLMs attempted to cheat even before trying to answer questions.