Ezra Newman comments on Why Did My Model Do That? Model Incrimination for Diagnosing LLM Misbehavior

Ezra Newman 29 Apr 2026 9:53 UTC
2 points
0
Another possibility is that verbalized eval awareness goes up as an evaluation gets more realistic, because the model has to reason about if it’s in an eval vs just knows without needing to reason about it if the env is sufficiently toy.
You might (rightly) quibble that the green line could be above the red line at various points, if the model reasons that it could be in an eval and then decides that it isn’t.