When I made the claim “we should get verbalized eval awareness to zero,” I mean it in the narrow context of “mitigating evaluation awareness during alignment evaluations through contrastive activation steering.” I am not saying that we should restructure the training process to mitigated verbalized evaluation awareness.
So I don’t think concerns about monitorability is very relevant here? Like I think you should just do a sweep over steering layers/strengths, and nowhere in that sweep should the model be doing egregiously bad things.
And I think it’s reasonable to expect that the effect of steering on verbalized/unverbalized eval awareness is quite correlated especially when you construct your steering vector using contrastive prompts as opposed to e.g., examples of verbalized eval awareness.
When I made the claim “we should get verbalized eval awareness to zero,” I mean it in the narrow context of “mitigating evaluation awareness during alignment evaluations through contrastive activation steering.” I am not saying that we should restructure the training process to mitigated verbalized evaluation awareness.
So I don’t think concerns about monitorability is very relevant here? Like I think you should just do a sweep over steering layers/strengths, and nowhere in that sweep should the model be doing egregiously bad things.
And I think it’s reasonable to expect that the effect of steering on verbalized/unverbalized eval awareness is quite correlated especially when you construct your steering vector using contrastive prompts as opposed to e.g., examples of verbalized eval awareness.