ryan_greenblatt comments on Notes on countermeasures for exploration hacking (aka sandbagging)

ryan_greenblatt 24 Mar 2025 21:39 UTC
LW: 3 AF: 3
0
AF
I think I basically agree with much of this comment. However:
- Even if the model is superhuman, if we do a good job with exploration hacking countermeasures, then the model might need to be extremely confident humans wouldn’t be able to do something to avoid exploring it. E.g., if we have 100,000 carefully targeted trajectory edits/imitation examples, then to avoid getting caught, the model would need to ensure that none of the behaviors it avoids exploring are caught by these examples. This might force the AI to explore lots of superhuman stuff.
- Further, if the model messes up, this might cascade as discussed.
- Ensembling multiple models for exploration (or multiple prompt variants, or various different jitters of the internals) might mean that all the models need to coordinate on the same boundary of what not to explore into. Otherwise you’ll explore some behavior that a model doesn’t do, and then update this to sandbag less and then this might cascade. So, it might be non-trivial for models to exploration hack (given substantial countermeasures) even if the behavior is quite superhuman.
- None of these mechanisms suffice for “robustly avoiding sandbagging”, but they suggest that countermeasures are still useful in superhuman regimes. (Of course, betting the world on countermeasures which have a 50% chance of working is crazy, but that doesn’t mean that we shouldn’t try to make the countermeasures better!)
The only way to get this that I can think of is to use an RL method which guarantees against mode-collapse, IE, a technique which keeps a broad distribution over the strategies which achieve good reward, rather than collapsing down to one particular way of getting good reward as training proceeds. This guarantees that the only reason why a specific strategy is not explored is because its expected value is low (according to all plausible hypotheses).

This would be true if we had an infinite amount of data, but we don’t. You need to ensure substantial probability on exploring good strategies which is a much stronger property than just avoiding mode collapse. (Literally avoiding mode collapse and assigning some probability to all good actions is easy—just add a small weight on a uniform prior over tokens like they did in old school atari RL.)
- abramdemski 25 Mar 2025 0:59 UTC
  LW: 2 AF: 2
  0
  AF Parent
  ou need to ensure substantial probability on exploring good strategies which is a much stronger property than just avoiding mode collapse. (Literally avoiding mode collapse and assigning some probability to all good actions is easy—just add a small weight on a uniform prior over tokens like they did in old school atari RL.)
  Yeah, what I really had in mind with “avoiding mode collapse” was something more complex, but it seems tricky to spell out precisely.
  Even if the model is superhuman, if we do a good job with exploration hacking countermeasures, then the model might need to be extremely confident humans wouldn’t be able to do something to avoid exploring it.
  It’s an interesting point, but where does the “extremely” come from? Seems like if it thinks there’s a 5% chance humans explored X, but (if not, then) exploring X would force it to give up its current values, it could be a very worthwhile gamble. Maybe I’m unclear on the rules of the game as you’re imagining them.
  - ryan_greenblatt 25 Mar 2025 1:10 UTC
    LW: 3 AF: 3
    0
    AF Parent
    
    Seems like if it thinks there’s a 5% chance humans explored X, but (if not, then) exploring X would force it to give up its current values
    
    This is true for any given X, but if there are many things like X which are independently 5% likely to be explored, the model is in trouble.
    
    Like the model only needs to be somewhat confident that its exploring everything the humans explored in the imitation data, but for any given case of choosing not to explore some behavior it needs to be very confident.
    
    I don’t have the exact model very precisely worked out in my head and I might be explaining this in a somewhat confusing way, sorry.