Here’s another way of thinking about this that should be complementary to my other comment:
Let’s assume our safety property we want to ensure is
Low correlation over side objectives. E.g. In a sequential setting (c.f. my other comment #2), Alice must have a hard time picking vulnerabilities that all serve the same side objective.
The overlooked point which I want to make here is that {S}<<{E}. In this case, there are two possibilities:
Alice is benign by default. In this case, Bob will pick E which are low probability for Alice, most such prompts are not safety relevant. They are just structural things which prevent mode collapse in various ways.
Alice is sneaky and optimizes some S by default. In this case, Bob will play classifiers whose acceptance set have low intersection with Alice’s S. If we grant all Alice’s possible sneaky goals are hard to jointly satisfy without tanking reward, then heuristically Bob forces Alice to place at most v*--the game’s value—weight on any S. Now Alice has a very hard time coordinating on optimizing for the same S across problem instances.
Here’s another way of thinking about this that should be complementary to my other comment:
Let’s assume our safety property we want to ensure is
The overlooked point which I want to make here is that {S}<<{E}. In this case, there are two possibilities:
Alice is benign by default. In this case, Bob will pick E which are low probability for Alice, most such prompts are not safety relevant. They are just structural things which prevent mode collapse in various ways.
Alice is sneaky and optimizes some S by default. In this case, Bob will play classifiers whose acceptance set have low intersection with Alice’s S. If we grant all Alice’s possible sneaky goals are hard to jointly satisfy without tanking reward, then heuristically Bob forces Alice to place at most v*--the game’s value—weight on any S. Now Alice has a very hard time coordinating on optimizing for the same S across problem instances.