Avi Brach-Neufeld comments on Emergent Introspective Awareness in Large Language Models

Avi Brach-Neufeld 8 Nov 2025 15:56 UTC
9 points
6
These findings make me much less confident in the technique Anthropic used when safety testing sonnet 4.5^[1] (subtracting out the evaluation awareness vector). If models can detect injected/subtracted vectors, that becomes a bad way of convincing them they are not being tested. Vibe wise I think the model’s are not currently proficient enough in introspection for this to make a huge difference, but it really calls into question vector steering around evaluation awareness as a long term solution.
1. ^
  https://assets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf