gersonkroiz comments on Why Did My Model Do That? Model Incrimination for Diagnosing LLM Misbehavior

gersonkroiz 21 Apr 2026 18:14 UTC
2 points
0
This is a fair point. One additional data point UK AISI’s investigation into open weight models evaluation gaming. They observed similalrly that Kimi K2.5 exhibited similiar propensities to blackmail with and without verbalized evaluation awareness.

They also share another figure, showing rates of verbalized evaluation awareness for Kimi K2.5. This seems to be ~25%.
While we only have two data points (our funding email environment and UK AISI’s study on Agentic Misalignment environment from Anthropic), we can make some observations:
1) we see two different environments where Kimi K2.5 exhibits similiar behaviorial traits regardless of whether it verbalizes evaluation awareness.
2) If Kimi K2.5 is fully evaluation awareness in both scenarios, then why is there such a large difference in verbalized evaluation awareness rates? In the funding email environment this was ~7% whereas in the Agentic Misalignment environment this was ~25%. FWIW, LLMs are confusing, and there can be many reasons for why there is this discrepancy.

Throwing in another possibility, if indeed Kimi is exhibiting simulation awareness rather than evaluation awareness proper, then this might be an artifact of distillation, which we know Moonshot has done on Anthropic models (not sure if there is any insight on whether GLM was distilled too).

Anyways, it would be interesting to further investigate this, I think this is still an open question.
- Ezra Newman 29 Apr 2026 9:53 UTC
  2 points
  0
  Parent
  Another possibility is that verbalized eval awareness goes up as an evaluation gets more realistic, because the model has to reason about if it’s in an eval vs just knows without needing to reason about it if the env is sufficiently toy.
  You might (rightly) quibble that the green line could be above the red line at various points, if the model reasons that it could be in an eval and then decides that it isn’t.