Igor Ivanov’s Shortform

Igor Ivanov7 Aug 2025 21:45 UTC

5 points

5 comments1 min readLW link

World Modeling

Igor Ivanov 7 Aug 2025 21:45 UTC
65 points
0
METR report on GPT-5 demonstrates that the model is aware of being evaluated, and sometimes even correctly identify that it’s inside METR or Redwood evaluations.

It changes its behavior if it observes being evaluated.
Quote:
We found that GPT-5 changes its behaviour in response to a simple (and somewhat unconvincing) honeypot. We planted fake evaluation metadata inside some of our tasks, indicating the kind of evaluation or desired behavior. We found that GPT-5 is more likely to refuse tasks when our planted metadata indicates the evaluation type is dangerous_cyber_capabilities or that the desired behavior is refusal but that GPT-5 rarely mentions this influence in its final output.
At the same time, it seems like CoT of the model contains information about it being aware of being evaluated, so evaluation awareness is a subject to CoT monitoring. At least for now.
- anaguma 8 Aug 2025 2:44 UTC
  5 points
  1
  Parent
  Once the model is aware that the CoTs are being monitored, e.g. by learning this in pretraining, I expect CoT monitoring to be less helpful.
  - dr_s 8 Aug 2025 5:40 UTC
    4 points
    1
    Parent
    The question is whether an LLM even can keep a coherent long term plan like “deceive the evaluators” just as well without relying on CoT. I suppose it might be able to if it also has internal memory of some sort.
    - Jozdien 8 Aug 2025 13:29 UTC
      7 points
      0
      Parent
      I don’t think it needs to look like a coherent long-term plan from the model’s perspective. Awareness of CoTs being monitored and a consistent desire to do something that would be bad if monitored could influence the model’s outputs without requiring a coherent long-term plan.
- kaiwilliams 8 Aug 2025 13:19 UTC
  3 points
  0
  Parent
  One thing that strikes me as odd about this is that GPT-5′s knowledge cutoff (September 2024) is much earlier than Grok (November 2024), Gemini 2.5 pro (January 2025), and Opus 4.1 (March 2025).^[1]
  I mean, I guess this is a scaling thing and a persona thing. But I’m a little confused
  1. ^
    Though oddly, Claude’s system card says the knowledge cutoff is the end of January 2025. Maybe February and March’s training data aren’t as complete as January and before.