Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?

Neel Nanda1 Nov 2022 23:56 UTC

LW: 69 AF: 29

AI Interpretability (ML & AI)Distillation & Pedagogy

New experiment: Recording myself real-time as I do mechanistic interpretability research! I try to answer the question of what happens if you train a toy transformer without positional embeddings on the task of “predict the previous token”—turns out that a two layer model can rederive them! You can watch me do it here, and you can follow along with my code here. This uses a transformer mechanistic interpretability library I’m writing called EasyTransformer, and this was a good excuse to test it out and create a demo!

This is an experiment in recording and publishing myself doing “warts and all” research—figuring out how to train the model and operationalising an experiment (including 15 mins debugging loss spikes...), real-time coding and tensor fuckery, and using my go-to toolkit. My hope is to give a flavour of what actual research can look like—how long do things actually take, how often do things go wrong, what is my thought process and what am I keeping in my head as I go, what being confused looks like, and how I try to make progress. I’d love to hear whether you found this useful, and whether I should bother making a second half!

Though I don’t want to overstate this—this was still a small, self-contained toy question that I chose for being a good example task to record (and I wouldn’t have published it if it was TOO much of a mess).

What links here?

Neel Nanda1 Nov 2022 23:56 UTC

LW: 69 AF: 29

16 comments1 min readLW link

AI Interpretability (ML & AI)Distillation & Pedagogy

Logan Riggs 7 Nov 2022 3:56 UTC
LW: 4 AF: 2
0
AF
I’d love to hear whether you found this useful, and whether I should bother making a second half!
We had 5 people watch it here, and we would like a part 2:)

We had a lot of fun pausing the video and making forward predictions, and we couldn’t think of any feedback for you in general.
- Neel Nanda 7 Nov 2022 13:43 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Thanks for the feedback! I’m impressed you had 5 people interested! What context was this in? (Ie, what do you mean by “here”?)
Arthur Conmy 7 May 2023 15:18 UTC
3 points
0
This occurs across different architectures and datasets (https://arxiv.org/abs/2203.16634)

[from a quick skim this video+blog post doesn’t mention this]
- Neel Nanda 8 May 2023 6:10 UTC
  2 points
  0
  Parent
  Thanks! Yeah, I hadn’t seen that but someone pointed it out on Twitter. Feels like fun complimentary work
Garrett Baker 4 Nov 2022 0:42 UTC
LW: 3 AF: 1
2
AF
This was really really helpful! I learned a lot about how to think through experiment design, watching you do it, and I found some possible-mistakes I’ve been making while designing my own experiments!
My only criticism: When copilot auto-fills in details, it would be helpful if you’d explain what it did and why its what you wanted it to do, like how you do with your own code.
- Neel Nanda 4 Nov 2022 1:23 UTC
  LW: 3 AF: 1
  0
  AF Parent
  Awesome, really appreciate the feedback! And makes sense re copilot, I’ll keep that in mind in future videos :) (maybe should just turn it off?)
  
  I’d love to hear more re possible-mistakes if you’re down to share!
  - Garrett Baker 4 Nov 2022 1:29 UTC
    LW: 3 AF: 1
    0
    AF Parent
    The main big one was that when I was making experiments, I did not have in mind a particular theory about how the network was doing a particular capability. I just messed around with matrices, and graphed a bunch of stuff, and multiplied a bunch of weights by a bunch of other weights. Occasionally, I’d get interesting looking pictures, but I had no clue what to do with those pictures, or followup questions I could ask, and I think it’s because I didn’t have an explicit model of what I think it should be doing, and so couldn’t update my picture of the mechanisms the network was using off the data I gathered about the network’s internals.
    - Neel Nanda 4 Nov 2022 1:42 UTC
      LW: 3 AF: 2
      0
      AF Parent
      Makes sense, thanks! Fwiw, I think the correct takeaway is a mix of “try to form hypotheses about what’s going on” and “it’s much, much easier when you have at least some surface area on what’s going on”. There are definitely problems where you don’t really know going in (eg, I did not expect modular addition to be being solved with trig identities!), and there’s also the trap of being overconfident in an incorrect view. But I think the mode of iteratively making and testing hypotheses is pretty good.
      
      An alternate, valid but harder, mode is to first do some exploratory analysis where you just hit the problem with a standard toolkit and see what sticks, without any real hypothesis. And then use this raw data to go off and try to form a hypothesis about what’s going on, and what to do next to test/try to break it.
      - Garrett Baker 4 Nov 2022 7:14 UTC
        1 point
        0
        AF Parent
        What do you mean by “surface area”?
        Neel Nanda 6 Nov 2022 15:37 UTC
        LW: 3 AF: 1
        0
        AF Parent
        I use surface area as a fuzzy intuition around “having some model of what’s going on, and understanding of what’s happening in a problem/phenomena”. Which doesn’t necessarily looking like a full understanding, but looks like having a list in my head of confusing phenomena, somewhat useful ideas, and hooks into what I could investigate next.
        
        I find this model useful both to recognise ‘do I have any surface area on this problem’ and to motivate next steps by ‘what could give me more surface area on this problem’ even if it’s not a perfectly robust way.
MadHatter 4 Nov 2022 4:11 UTC
2 points
0
Very cool stuff! Do you have the notebook on colab or something? Kind of want to find out how the story ends, whether that’s in a second half video or just playing around with the code. At the end of this video you had what looked like fairly clean positional embeddings coming out of MLP0. Also the paying-attention-to-self in the second attention layer could plausibly be something to do with erasing the information that comes in on that token, since that’s something that all transformer decoders have to do in some fashion or another.
Pretty sure the loss spikes were coming from using max rather than min when defining the learning rate schedule. Your learning rate multiplier starts at 1 and then linearly increases as step/100 once it reaches 100, which makes sense why it behaves itself for a while and then ultimately diverges for large numbers of steps.
- MadHatter 4 Nov 2022 5:14 UTC
  1 point
  2
  Parent
  Yeah, just changing the max to a min produces this much smoother loss curve from your notebook..
- MadHatter 4 Nov 2022 4:56 UTC
  1 point
  0
  Parent
  Oops, did not read the post carefully enough, you’ve already linked to the colab!
Gabe M 8 Nov 2022 18:04 UTC
1 point
3
Wow, this is a cool concept and video, thanks for making it! As a new person to the field, I’d be really excited for you and other AI safety researchers to do more devlog/livestream content of the form “strap a GoPro on me while I do research!”
dkirmani 8 Nov 2022 11:07 UTC
LW: 1 AF: 1
0
AF
guessing this wouldn’t work without causal attention masking
- Neel Nanda 8 Nov 2022 12:32 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Yeah, I think that’s purely symmetric.