I’d propose also not doing this and instead trying to roughly understand what makes eval awareness increase and avoid this in general.
I also agree with this approach, my worry is the incentive structures here are:
It’s always going to be cheap to generate synthetic environments like this
It seems like currently (and my guess going forward, but uncertain here) is that the ability of generation N models to generate environments “realistic enough” for generation N+1 models is always going to be lagging behind where we’d need it to be
The cheap generated synthetic environments will make your model seem more aligned while making misaligned actions harder and harder to spot
I’d be a bit more optimistic about some training intervention that reduced the salience of the features contributing to unverbalized evaluation awareness rather than trying to get labs to avoid training on the cheap synthetic scenarios in the first place. The first step to figuring out a training intervention is to understand the cause of the increase in detail anyway though, so more just writing this out in case you already had ideas or intuitions here.
I also agree with this approach, my worry is the incentive structures here are:
It’s always going to be cheap to generate synthetic environments like this
It seems like currently (and my guess going forward, but uncertain here) is that the ability of generation N models to generate environments “realistic enough” for generation N+1 models is always going to be lagging behind where we’d need it to be
The cheap generated synthetic environments will make your model seem more aligned while making misaligned actions harder and harder to spot
I’d be a bit more optimistic about some training intervention that reduced the salience of the features contributing to unverbalized evaluation awareness rather than trying to get labs to avoid training on the cheap synthetic scenarios in the first place. The first step to figuring out a training intervention is to understand the cause of the increase in detail anyway though, so more just writing this out in case you already had ideas or intuitions here.