TekhneMakre comments on AGI Ruin: A List of Lethalities

TekhneMakre 12 Jun 2022 21:53 UTC
4 points
2
Why wouldn’t the explainer just copy the latent vector, and the explainee just learn to do the task in the same way the original model does it? Or more generally, why does this put any pressure towards “explaining the reasons/motives” behind the original model’s actions? I think you’re thinking that by using a pre-trained GPT3-alike as the explainer model, you start off with something a lot more language-y, and language-y concepts are there for easy pickings for the training process to find in order to “communicate” between the original model and the explainee model. This seems not totally crazy, but
1. it seems to buy you, not anything like further explanations of reasons/motives beyond what’s “already in” the original model, but rather at most a translation into the explainer’s initial pre-trained internal language;
2. the explainer’s initial idiolect stays unexplained / unmotivated;
3. the training procedure doesn’t put pressure towards explanation, and does put pressure towards copying.
- eeegnu 13 Jun 2022 5:02 UTC
  1 point
  0
  Parent
  These are great points, and ones which I did actually think about when I was brainstorming this idea (if I understand them correctly.) I intend to write out a more thorough post on this tomorrow with clear examples (I originally imagined this as extracting deeper insights into chess), but to answer these:
  1. I did think about these as translators for the actions of models into natural language, though I don’t get the point about extracting things beyond what’s in the original model.
  2. I mostly glossed over this part in the brief summary, and the motivation I had for it comes from how (unexpectedly?) it works for GAN’s to just start with random noise, and in the process the generator and discriminator both still improve each other.
  3. My thoughts here were for the explainer models update error vector to come from judging the learner model on new unseen tasks without the explanation (i.e. how similar are they to the original models outputs.) In this way the explainer gets little benefit from just giving the answer directly, since the learner will be tested without it, but if the explanation in any way helps the learner learn, it’ll improve its performance more (this is basically what the entire idea hinges on.)
  - TekhneMakre 13 Jun 2022 5:22 UTC
    2 points
    0
    Parent
    (I didn’t understand this on one read, so I’ll wait for the post to see if I have further comments. I didn’t understand the analogy / extrapolation drawn in 2., and I didn’t understand what scheme is happening in 3.; maybe being a little more precise and explicit about the setup would help.)