Tim Hua comments on Can Models be Evaluation Aware Without Explicit Verbalization?

Tim Hua 23 Dec 2025 5:42 UTC
1 point
0
[Edit: This experiment is done on the model organism post expert iteration, not for the SDF-only model]
I replicated a version of this experiment where
1. I automatically graded transcripts based on whether they include verbalized evaluation awareness
2. I also pre-filled the ```python tokens in the model’s response.
And I found that steering still affects the type hint rate, although the model no longer has a compliance gap without any steering.
(Another difference here is that I’m using a larger set of questions.)