Daniel Kokotajlo comments on Gradations of Inner Alignment Obstacles

Daniel Kokotajlo 29 Apr 2021 13:02 UTC
LW: 2 AF: 2
0
AF
Part of my idea for this post was to go over different versions of the lottery ticket hypothesis, as well, and examine which ones imply something like this. However, this post is long enough as it is.
I’d love to see you do this!
Re: The Treacherous Turn argument: What do you think of the following spitball objections:
(a) Maybe the deceptive ticket that makes T’ work is indeed there from the beginning, but maybe it’s outnumbered by ‘benign’ tickets, so that the overall behavior of the network is benign. This is an argument against premise 4, the idea being that even though the deceptive ticket scores just as well as the rest, it still loses out because it is outnumbered.
(b) Maybe the deceptive ticket that makes T’ work is not deceptive from the beginning, but rather is made so by the training process T’. If instead you just give it T, it does not exhibit malign off-T behavior. (Analogy: Maybe I can take you and brainwash you so that you flip out and murder people when a certain codeword reaches your ear, and moreover otherwise act completely normally so that you’d react exactly the same way to everything in your life so far as you in fact have. If so, then the “ticket” that makes this possible is already present inside you, even now as you read these words! But the ‘ticket’ is just you. And you won’t actually flip out and murder people if the codeword reaches your ear, because you haven’t in fact been brainwashed.)
What links here?
- Daniel Kokotajlo's comment on Parsing Abram on Gradations of Inner Alignment Obstacles by Alex Flint (4 May 2021 19:57 UTC; 5 points)
- abramdemski 30 Apr 2021 15:20 UTC
  LW: 4 AF: 4
  0
  AF Parent
  (a) Maybe the deceptive ticket that makes T’ work is indeed there from the beginning, but maybe it’s outnumbered by ‘benign’ tickets, so that the overall behavior of the network is benign. This is an argument against premise 4, the idea being that even though the deceptive ticket scores just as well as the rest, it still loses out because it is outnumbered.
  My overall claim is that attractor-basin type arguments need to address the base case. This seems like a potentially fine way to address the base-case, if the math works out for whatever specific attractor-basin argument. If we’re trying to avoid deception via methods which can steer away from deception if we assume there’s not yet any deception, then we’re in trouble; the technique’s assumptions are violated.
  (b) Maybe the deceptive ticket that makes T’ work is not deceptive from the beginning, but rather is made so by the training process T’.
  Right, this seems in line with the original lottery ticket hypothesis, and would alleviate the concern. It doesn’t seem as consistent with the tangent space hypothesis, though.