Julian Stastny comments on Why “training against scheming” is hard

Julian Stastny 24 Jun 2025 21:42 UTC
LW: 1 AF: 1
0
AF
For example, it could exploration hack during training:
1. It could strategically avoid exploring high-reward regions such that they never get reinforced
2. It could strategically explore in ways that don’t lead to general updates in behavior
3. It could condition its behavior on the fact that it is in a training environment and not real deployment such that it learns “don’t scheme when observed” rather than “never scheme”
Nitpick: The third item is not an instance of exploration hacking.