For example, it could exploration hack during training:
It could strategically avoid exploring high-reward regions such that they never get reinforced
It could strategically explore in ways that don’t lead to general updates in behavior
It could condition its behavior on the fact that it is in a training environment and not real deployment such that it learns “don’t scheme when observed” rather than “never scheme”
Nitpick: The third item is not an instance of exploration hacking.
Nitpick: The third item is not an instance of exploration hacking.