Evaluation awareness describes the phenomenon of an LLM inferring from various cues that it is under evaluation.
This is a nitpick, but it feels like an important one: Shouldn’t this be more generally about “AI systems”, rather than purely about LLMs? - The biggest impact is likely to come from AIs that incorporate LLMs, but aren’t necessarily just an LLM with no extra bits. - Anchoring on LLMs might give us quite wrong intuitions about evaluation awareness. (To give one example: With LLMs, it is easy to imagine that the AI wouldn’t know what day it is. But as soon as you realise that many AIs will eventually be deployed in a way that gives them internet access, it becomes clear that preventing an AI from learning the current date can be really difficult or even unrealistic.)
This is a nitpick, but it feels like an important one: Shouldn’t this be more generally about “AI systems”, rather than purely about LLMs?
- The biggest impact is likely to come from AIs that incorporate LLMs, but aren’t necessarily just an LLM with no extra bits.
- Anchoring on LLMs might give us quite wrong intuitions about evaluation awareness. (To give one example: With LLMs, it is easy to imagine that the AI wouldn’t know what day it is. But as soon as you realise that many AIs will eventually be deployed in a way that gives them internet access, it becomes clear that preventing an AI from learning the current date can be really difficult or even unrealistic.)