Really Strong Features Found in Residual Stream

[I’m writing this quickly because the results are really strong. I still need to do due diligence & compare to baselines, but it’s really exciting!]

Last post I found 600+ features in MLP layer-2 of Pythia-70M, but ablating the direction didn’t always make sense nor the logit lens. Now I’ve found 3k+ in the residual stream, and the ablation effect is intuitive & strong.

The majority of features found were single word or combined words (e.g. not & ’t ) which, when ablated, effected bigram statistics, but I also wanted to showcase the more interesting ones:

German Feature?

Uniform examples:

Logit lens:

Stack Exchange

Ablated Text (w/​ red showing removing that token drops activation of that feature)

In the Pile dataset, stackexchange text is formatted as “Q:/​n/​n....”, and removing the “Q” greatly decreasing the feature activation is evidence for this indeed detecting stackexchange format.

Title Case

Ablating Direction (in text):

Here, the log-prob of predicting ” Of” went down by 4 when I ablate this feature direction, along w/​ other Title Case texts.

Logit lens

One for the Kids

Last Name:

Ablated Text:

As expected, removing the first name affects the feature activation

Logit Lens:

More specifically, this looks like the first token of the last name, which makes since to follow w/​ Jr or Sr or completing last names w/​ “hoff” & “worth”. Additionally, the “(\@” makes sense for someone’s social media reference (e.g. instagram handle).

[word] and [word]

Ablated Text:

Logit Lens:

alike & respectively make a lot of since in context, but not positive about the rest.

Beginning & End of First Sentence?

On

logit lens:

“on behalf”, “oncology”, “onyx” (lol), “onshore”, “onlook”, “onstage”, “onblur”(js?), “oneday” (“oneday” is separated as on eday?!?), “onloaded” (js again?), “oncomes”(?)

High Level Picture

I could spend 30 minutes (or hours) developing very rigorous hypotheses on these features to really nail down what distribution of inputs activate it & what effect it really has on the output, but that’s not important.

What’s important is that we have a decomposition of the data that:
1. Faithfully represents the original data (in this case, low reconstruction loss w/​ minimal effect on perplexity)

2. Can simply specify desired traits in the model (e.g. honesty, non-deception)

There is still plenty of work to do to ensure the above two work (including normal hyperparameter search), but it’d be great to have more people working on this (I don’t have time to mentor, but I can answer questions & help get you set up) if that’s helping our work or integrating it in your work.

This work is legit though, and there’s a recent paper that finds similar features in a BERT model.