abramdemski comments on Notes on countermeasures for exploration hacking (aka sandbagging)

abramdemski 25 Mar 2025 0:59 UTC
LW: 2 AF: 2
0
AF
ou need to ensure substantial probability on exploring good strategies which is a much stronger property than just avoiding mode collapse. (Literally avoiding mode collapse and assigning some probability to all good actions is easy—just add a small weight on a uniform prior over tokens like they did in old school atari RL.)
Yeah, what I really had in mind with “avoiding mode collapse” was something more complex, but it seems tricky to spell out precisely.
Even if the model is superhuman, if we do a good job with exploration hacking countermeasures, then the model might need to be extremely confident humans wouldn’t be able to do something to avoid exploring it.
It’s an interesting point, but where does the “extremely” come from? Seems like if it thinks there’s a 5% chance humans explored X, but (if not, then) exploring X would force it to give up its current values, it could be a very worthwhile gamble. Maybe I’m unclear on the rules of the game as you’re imagining them.
- ryan_greenblatt 25 Mar 2025 1:10 UTC
  LW: 3 AF: 3
  0
  AF Parent
  
  Seems like if it thinks there’s a 5% chance humans explored X, but (if not, then) exploring X would force it to give up its current values
  
  This is true for any given X, but if there are many things like X which are independently 5% likely to be explored, the model is in trouble.
  
  Like the model only needs to be somewhat confident that its exploring everything the humans explored in the imitation data, but for any given case of choosing not to explore some behavior it needs to be very confident.
  
  I don’t have the exact model very precisely worked out in my head and I might be explaining this in a somewhat confusing way, sorry.