abramdemski comments on Gradations of Inner Alignment Obstacles

abramdemski 30 Apr 2021 15:20 UTC
LW: 4 AF: 4
0
AF
(a) Maybe the deceptive ticket that makes T’ work is indeed there from the beginning, but maybe it’s outnumbered by ‘benign’ tickets, so that the overall behavior of the network is benign. This is an argument against premise 4, the idea being that even though the deceptive ticket scores just as well as the rest, it still loses out because it is outnumbered.
My overall claim is that attractor-basin type arguments need to address the base case. This seems like a potentially fine way to address the base-case, if the math works out for whatever specific attractor-basin argument. If we’re trying to avoid deception via methods which can steer away from deception if we assume there’s not yet any deception, then we’re in trouble; the technique’s assumptions are violated.
(b) Maybe the deceptive ticket that makes T’ work is not deceptive from the beginning, but rather is made so by the training process T’.
Right, this seems in line with the original lottery ticket hypothesis, and would alleviate the concern. It doesn’t seem as consistent with the tangent space hypothesis, though.