[2104.07143v1] An Interpretability Illusion for BERT (arxiv.org) suggests a more complicated picture wherein many neurons give the impression that they’re encoding coherent concepts, but then seem to encode completely different concepts when tested on a different dataset. They’re certainly not directly contradictory, but Figure 2 of the illusion paper suggests the opposite of what Figure 5 of the Knowledge Neuron’s paper suggests. On the other hand, the illusion paper mentions they found tentative evidence for the existence of global concept directions and perhaps all knowledge neurons are such global concept directions.
Ordered from most to least plausible, possible explanations for this apparent discrepancy include:
Knowledge neurons are more specialized than the average neuron (knowledge neurons are ‘global’)
Dataset choice matters. In particular, Pararel sentences isolate relations in a way that other datasets don’t, helping to identify specialized neurons
Attribution method matters
Layer choice matters (Illusion papers mentions quick looks at layers 2 and 7 showed similar results, Knowledge neuron paper motivates the layer choice by analogy to key-value pairs)
Not sure what the best way to formalize this intuition is, but here’s an idea. (To isolate this learner-agnostic/specific axis from the problem of defining explanation, let me assume that we have some metric for quantifying explanation quality, call it ‘R’ which is a function from <Model, learner, explanation> triples to real values.)
Define learner-agnostic explanation as optimizing for aggregate R across some distribution of learners—finding the one optimal explanation across this distribution. Learner-specific explanation optimizes for R taking the learner as an input—finding multiple optimal explanations, one for each learner.
The aggregation function in the learner-agnostic case could be the mean, or it could be a minimax function. The minimax case intuition would be formalizing the task of coming up with the most accessible explanation possible.
Things like influence functions, input-sensitivity methods, automated concept discovery are all learner-agnostic. On the other hand, probing methods (e.g. as used in NLP) could maybe be called learner-specific. The variant of influence functions I suggested above is learner-specific.
In general, it seems to me that as the models get more and more complex, we’ll probably need explanations to be more learner-specific to achieve reasonable performance. Though perhaps learner-agnostic methods will suffice for answering general questions like ‘Is my model optimizing for a mesa-objective’?
One axis along which I’d like clarification is whether you want a form of explanation which is learner agnostic or learner specific? It seems to me that traditional transparency/interpretability tools try to be learner agnostic, but on the other hand the most efficient way to explain makes use of the learner’s pre-existing knowledge, inductive biases, etc.
In the learner agnostic case, I think it will be approximately impossible to succeed at this challenge. In the learner specific case, I think it will require something more than an interpretability method. This latter task will benefit from better and better models of human learning—in the limit I imagine something like a direct brain neuralink should do the trick...
On the learner specific side, it seems to me Nisan is right when he said ‘The question is if we can compress the bot’s knowledge into, say, a 1-year training program for professionals.’ To that end, it seems like a relevant method could be an improved version of influence functions. Something like find in the training phase when the go agent learned to make a better move than the pro and highlight the games (/moves) which taught it the improved play.
Great post! I am very curious about how people are interpreting Q10 and Q11, and what their models are. What are prototypical examples of ‘insights on a similar level to deep learning’?
Here’s a break-down of examples of things that come to my mind:
Historical DL-level advances:
the development of RL (Q-learning algorithm, etc.)
Original formulation of a single neuron i.e. affine transformation + non-linearity
Future possible DL-level:
a successor to back-prop (e.g. the how biological neurons learn)
a successor to the Q-learning family (e.g. neatly generalizing and extending ‘intrinsic motivation’ hacks)
full brain simulation
an alternative to the affine+activation recipe
Below DL-level major advances:
an elegant solution to learn from cross-modal inputs in a self-supervised fashion (babies somehow do it)
a breakthrough in active learning
a generalizable solution to learning disentangled and compositional representations
a solution to adversarial examples
breakthroughs in neural architecture search
a breakthrough in neural Turing machine-type research
I’d also like to know how people’s thinking fits in with my taxonomy: Are people who leaned yes on Q11 basing their reasoning on the inadequacy of the ‘below DL-level advances’ list, or perhaps on the necessity of the ‘DL-level advances’ list? Or perhaps people interpreted those questions completely differently, and don’t agree with my dividing lines?
The above estimate was mislead since I had mistakenly read ′ I then compute the fraction #answers(Yes, Yes, Yes) / #answers(Yes, *, *) ′ as ′ I then compute the fraction #answers(Yes, Yes, Yes) / #answers(Yes, Yes, *)’.
I agree with Ethan’s recent comment that experience with RL matter a lot, so a lot comes down to how the ′ Is X’s work related to AGI? ′ criterion is cashed out. On some reading of this, many NLP researchers do not count, on another reading they do count. I’d say my previous prediction was a decent, if slightly over-estimate of the scenario in which ‘related to AGI’ is interpreted narrowly, and many NLP researchers are ruled out.
A second major confounder is whether prominent AI researchers are far more likely to have been asked about their opinion on AI safety in which case they have some impetus to go read up on the issue.
To cash some of these concerns out into probabilities:
75% that Rohin takes a broad interpretation of AGI which includes e.g. GPT-team, NAS research etc.
33% estimated (Yes,Yes,Yes) by assuming prominent researchers 2x as likely to have read up on AI safety.
25% downweighted from 33% taking into account industry being less concerned.
Assuming that we’re at ~33% now, 50% doesn’t seem too far out of reach, so my estimates for following decades are based on the same concerns I listed in my above comment framed with the 33% in mind.
Updated personal distribution: elicited
Updated Rohin’s posterior: elicited
(Post competition footnote: seems to me over short time horizons we should have a more-or-less geometric distribution. Think of the more-or-less independent per year chance that a NeurIPS keynote features AI safety, or youtube recommender algorithm goes bonkers for a bit. Seems strange to me that some other people’s distribution over the next 10-15 years—if not longer—do not look geometric.)
My old prediction for when the fraction be >= 0.5: elicited
My old prediction for Rohin’s posterior: elicited
I went through the top 20 list of most cited AI researchers on google scholar (thanks to Amanda for linking), and estimated that roughly 9 of them may qualify under Rohin’s criterion. Of those 9, my guess was that 7⁄9 would answer ‘Yes’ on Rohin’s question 3.
My sampling process was certainly biased. For one, AI researchers are likely to be more safety conscious than industry experts. My estimate also involved considerable guesswork, so I down-weighted the estimated 7⁄9 to a 65% chance that the >=0.5 threshold will be met within the first couple years. Given the extreme difference between my distribution and the others posted, I guess there’s a 1⁄3 chance that my estimate based on the top 20 sampling will carry significant weight in Rohin’s posterior.
The justification for the rest of my distribution is similar to what others have said here and elsewhere about AI safety. My AGI timeline is roughly in line with the metaculus estimate here. Before the advent of AGI, a number of eventualities are possible: a warning shot occurs, perhaps theoretical consensus will emerge, perhaps industry researchers will be oblivious to safety concerns because of a principal-agent nature to the problem, perhaps AGI will be invented before safety is worked out, etc.
Edit: One could certainly do a better job of estimating where the sample population of researchers currently stands by finding a less biased population. Maybe people interviewed by Lex Fridman, that might be a decent proxy for AGI-research-fame?