Knowledge Neurons in Pretrained Transformers

Link post

This is a link post for the Dai et al. paper “Knowledge Neurons in Pretrained Transformers” that was published on the arXiv last month. I think this paper is probably the most exciting machine learning paper I’ve read so far this year and I’d highly recommend others check it out as well. Edit: Maybe not; I think Paul’s skeptical take here is quite reasonable.

To start with, here are some of the basic things that the paper demonstrates:

  • BERT has specific neurons, which the authors call “knowledge neurons,” in its feed-forward layers that store relational facts (e.g. “the capital of Azerbaijan is Baku”) such that controlling knowledge neuron activations up-weights/​down-weights the correct answer in relational knowledge prompts (e.g. “Baku” in “the capital of Azerbaijan is <mask>”) even when the syntax of the prompt is changed—and the prompts that most activate the knowledge neuron all contain the relevant relational fact.

  • Knowledge neurons can reliably be identified via a well-justified integrated gradients attribution method (see also “Self-Attention Attribution”).

  • In general, the feed-forward layers of transformer models can be thought of as key-value stores that memorize relevant information, sometimes semantic and sometimes syntactic (see also “Transformer Feed-Forward Layers Are Key-Value Memories”) such that knowledge neurons are composed of a “key” (the first layer, prior to the activation function) and the “value” (the second layer, after the activation function).

The paper’s key results—at least as I see it, however—are the following:

  • Taking knowledge neurons that encode “the of is ” and literally just adding to the value neurons (where are just the embeddings of ) actually changes the knowledge encoded in the network such that it now responds to “the of is <mask>” (and other semantically equivalent prompts) with instead of .

  • For a given relation (e.g. “place of birth”), if all knowledge neurons encoding that relation (which ends up being a relatively small number, e.g. 5 − 30) have their value neurons effectively erased, the model loses the ability to predict the majority of relational knowledge involving that relation (e.g. 40 − 60%).

I think that particularly the first of these two results is pretty mind-blowing, in that it demonstrates an extremely simple and straightforward procedure for directly modifying the learned knowledge of transformer-based language models. That being said, it’s the second result that probably has the most concrete safety applications—if it can actually be scaled up to remove all the relevant knowledge—since something like that could eventually be used to ensure that a microscope AI isn’t modeling humans or ensure that an agent is myopic in the sense that it isn’t modeling the future.

Furthermore, the specific procedure used suggests that transformer-based language models might be a lot less inscrutable than previously thought: if we can really just think about the feed-forward layers as encoding simple key-value knowledge pairs literally in the language of the original embedding layer (as I think is also independently suggested by “interpreting GPT: the logit lens”), that provides an extremely useful and structured picture of how transformer-based language models work internally.