METR report on GPT-5 demonstrates that the model is aware of being evaluated, and sometimes even correctly identify that it’s inside METR or Redwood evaluations.
It changes its behavior if it observes being evaluated. Quote:
We found that GPT-5 changes its behaviour in response to a simple (and somewhat unconvincing) honeypot. We planted fake evaluation metadata inside some of our tasks, indicating the kind of evaluation or desired behavior. We found that GPT-5 is more likely to refuse tasks when our planted metadata indicates the evaluation type is dangerous_cyber_capabilities or that the desired behavior is refusal but that GPT-5 rarely mentions this influence in its final output.
At the same time, it seems like CoT of the model contains information about it being aware of being evaluated, so evaluation awareness is a subject to CoT monitoring. At least for now.
The question is whether an LLM even can keep a coherent long term plan like “deceive the evaluators” just as well without relying on CoT. I suppose it might be able to if it also has internal memory of some sort.
I don’t think it needs to look like a coherent long-term plan from the model’s perspective. Awareness of CoTs being monitored and a consistent desire to do something that would be bad if monitored could influence the model’s outputs without requiring a coherent long-term plan.
One thing that strikes me as odd about this is that GPT-5′s knowledge cutoff (September 2024) is much earlier than Grok (November 2024), Gemini 2.5 pro (January 2025), and Opus 4.1 (March 2025).[1]
I mean, I guess this is a scaling thing and a persona thing. But I’m a little confused
Though oddly, Claude’s system card says the knowledge cutoff is the end of January 2025. Maybe February and March’s training data aren’t as complete as January and before.
METR report on GPT-5 demonstrates that the model is aware of being evaluated, and sometimes even correctly identify that it’s inside METR or Redwood evaluations.
It changes its behavior if it observes being evaluated.
Quote:
At the same time, it seems like CoT of the model contains information about it being aware of being evaluated, so evaluation awareness is a subject to CoT monitoring. At least for now.
Once the model is aware that the CoTs are being monitored, e.g. by learning this in pretraining, I expect CoT monitoring to be less helpful.
The question is whether an LLM even can keep a coherent long term plan like “deceive the evaluators” just as well without relying on CoT. I suppose it might be able to if it also has internal memory of some sort.
I don’t think it needs to look like a coherent long-term plan from the model’s perspective. Awareness of CoTs being monitored and a consistent desire to do something that would be bad if monitored could influence the model’s outputs without requiring a coherent long-term plan.
One thing that strikes me as odd about this is that GPT-5′s knowledge cutoff (September 2024) is much earlier than Grok (November 2024), Gemini 2.5 pro (January 2025), and Opus 4.1 (March 2025).[1]
I mean, I guess this is a scaling thing and a persona thing. But I’m a little confused
Though oddly, Claude’s system card says the knowledge cutoff is the end of January 2025. Maybe February and March’s training data aren’t as complete as January and before.