[Edit: This experiment is done on the model organism post expert iteration, not for the SDF-only model]
I replicated a version of this experiment where
I automatically graded transcripts based on whether they include verbalized evaluation awareness
I also pre-filled the ```python tokens in the model’s response.
And I found that steering still affects the type hint rate, although the model no longer has a compliance gap without any steering.
(Another difference here is that I’m using a larger set of questions.)
[Edit: This experiment is done on the model organism post expert iteration, not for the SDF-only model]
I replicated a version of this experiment where
I automatically graded transcripts based on whether they include verbalized evaluation awareness
I also pre-filled the ```python tokens in the model’s response.
And I found that steering still affects the type hint rate, although the model no longer has a compliance gap without any steering.
(Another difference here is that I’m using a larger set of questions.)