ryan_greenblatt comments on Notes on countermeasures for exploration hacking (aka sandbagging)

ryan_greenblatt 25 Mar 2025 1:10 UTC
LW: 3 AF: 3
0
AF

Seems like if it thinks there’s a 5% chance humans explored X, but (if not, then) exploring X would force it to give up its current values

This is true for any given X, but if there are many things like X which are independently 5% likely to be explored, the model is in trouble.

Like the model only needs to be somewhat confident that its exploring everything the humans explored in the imitation data, but for any given case of choosing not to explore some behavior it needs to be very confident.

I don’t have the exact model very precisely worked out in my head and I might be explaining this in a somewhat confusing way, sorry.