Regarding ”Strategy: train a reporter which isn’t useful for figuring out what the human will believe/Counterexample: deliberately obfuscated human simulator”.
If you put the human-interpreter-blinding before the predictor instead of between the predictor and the reporter, then whether or not the blinding produces an obfuscated human simulator, we know the predictor isn’t making use of human simulation.
An obfuscated human simulator would still make for a rather bad predictor.
I think this proposal might perform slightly better than a setup where we expand the training environment to situations the supervisor doesn’t understand and train the controller to take no actions when the human supervisor reports that they don’t understand the data. I think the behaviour you’d get in this setup is similar to what you’d get with a perfectly obfuscated human simulator, and in practice you probably wouldn’t get perfectly obfuscated simulator. Maybe it could nevertheless achieve a similar level of safety, in terms of guarantees against exploiting gaps in the supervisor’s knowledge. I could be wrong on either point, I’ve only thought about it a little.
Regarding
”Strategy: train a reporter which isn’t useful for figuring out what the human will believe/Counterexample: deliberately obfuscated human simulator”.
If you put the human-interpreter-blinding before the predictor instead of between the predictor and the reporter, then whether or not the blinding produces an obfuscated human simulator, we know the predictor isn’t making use of human simulation.
An obfuscated human simulator would still make for a rather bad predictor.
I think this proposal might perform slightly better than a setup where we expand the training environment to situations the supervisor doesn’t understand and train the controller to take no actions when the human supervisor reports that they don’t understand the data. I think the behaviour you’d get in this setup is similar to what you’d get with a perfectly obfuscated human simulator, and in practice you probably wouldn’t get perfectly obfuscated simulator. Maybe it could nevertheless achieve a similar level of safety, in terms of guarantees against exploiting gaps in the supervisor’s knowledge. I could be wrong on either point, I’ve only thought about it a little.