Teaser: Hard-coding Transformer Models

MadHatter12 Dec 2021 22:04 UTC

74 points

Language Models (LLMs)Interpretability (ML & AI)

Transformer models are incredibly powerful for natural language tasks (and they are starting to find uses in many other fields of machine learning). Unfortunately, it is nigh-impossible to interpret what goes on inside them. OR IS IT???

In this post, I am trying to gauge potential community interest in a strand of research that I have been doing in my spare time off and on for the past year and a half (roughly).

I have found that I can, with a fair amount of effort, hard-code the weights of a transformer model in order to perform some very crude versions of linguistic tasks. So far I have achieved English-to-French translation (on a toy corpus of about 150 sentences), text classification (is a sentence grammatical or not? on a toy corpus of a couple hundred sentences), and sentiment analysis (again on a limited corpus). These results are obviously not impressive compared to the state of the machine learning field, but I am pretty sure that they can all be drastically scaled up with the investment of some time and energy. Unfortunately, I have a fairly demanding day job, and haven’t found the time and energy yet.

All of this is done by inspection (no gradient descent!). The process is a lot like programming, although it is more difficult than programming, at least right now for me. I am fairly certain that better tools and better notation can be developed to make the process easier. It is also almost certainly possible to combine hard-coding with gradient descent approaches to be able to scale these methods up in a slightly less labor-intensive way.

I think that these ideas could prove useful in alignment research—if we understand how a language model works in excruciating detail, it seems drastically more likely that we will be able to reason about and predict various misunderstandings rooted in the ambiguity of language. Given that language is (arguably) a fully general means of interacting with an artificial intelligence, it seems plausible to me that this work is on the critical path to alignment.

What links here?

MadHatter12 Dec 2021 22:04 UTC

74 points

19 comments1 min readLW link

Language Models (LLMs)Interpretability (ML & AI)

evhub 13 Dec 2021 8:05 UTC
19 points
0

Unfortunately, I have a fairly demanding day job, and haven’t found the time and energy yet.

Have you considered applying for a grant from the Long-Term Future Fund to buy out your day job so you can spend all your time working on this? As a fund manager for the LTFF, this is definitely the sort of thing we’re often happy to fund, and I think that the research you’re describing sounds pretty exciting.
- habryka 13 Dec 2021 8:25 UTC
  8 points
  0
  Parent
  Yeah, I also thought it was pretty interesting. I only thought about it for a few minutes, but it seems interesting enough to give it a shot, IMO.
- MadHatter 13 Dec 2021 12:41 UTC
  2 points
  0
  Parent
  I have definitely not thought about that before. Feedback from people I have shown this work to has ranged from (literally) “you are a madman” to “that looks cool” (and then never engaging with it).
  - Vivek Hebbar 24 Feb 2022 1:36 UTC
    5 points
    0
    Parent
    Any update on this (applying for funding)?
mtaran 12 Dec 2021 23:46 UTC
9 points
0
Sounds intriguing! You have a GitHub link? :)
- MadHatter 12 Dec 2021 23:59 UTC
  2 points
  0
  Parent
  It’s very, very rough, but: https://github.com/epurdy/hand
  - mtaran 13 Dec 2021 0:48 UTC
    2 points
    0
    Parent
    I’ll make sure to run it when I get to a laptop. But if you ever get a chance to set the distill.pub article up to run on heroku or something, that’ll increase how accessible this is by an order of magnitude.
    - Igor Ostrovsky 13 Dec 2021 20:59 UTC
      4 points
      0
      Parent
      I (not the OP) put it up here for now: https://igor0.github.io/hand/distill/
      I’ll take it down if MadHatter asks me or once there is an official site.
      - MadHatter 13 Dec 2021 21:24 UTC
        2 points
        0
        Parent
        Thanks for throwing it up there!!!
gwern 13 Dec 2021 2:06 UTC
8 points
0
Any relation to RASP?
- gwern 22 Dec 2021 16:54 UTC
  2 points
  0
  Parent
  https://transformer-circuits.pub/2021/framework/index.html
- Kenoubi 13 Dec 2021 17:08 UTC
  2 points
  0
  Parent
  Thank you for sharing this. I know it’s probably not why you posted it, but reading this paper was extremely helpful to me in understanding what Transformers are actually doing in the first place.
- Rudi C 13 Dec 2021 19:13 UTC
  1 point
  0
  Parent
  (Unrelated.) Have you considered putting an RSS field of your Twitter account on its bio? This way people can follow you without you needing to approve them, and since it’s read-only, your burden won’t increase.
  
  (Not to mention that RSS is a much better medium than Twitter in the first place.)
  - gwern 13 Dec 2021 22:56 UTC
    2 points
    0
    Parent
    I don’t think Twitter allows such RSS feeds.
- MadHatter 13 Dec 2021 2:15 UTC
  1 point
  0
  Parent
  It’s a pretty similar style of work, but I haven’t communicated at all with those authors and I started my work before they published.
  What links here?
  - Vivek Hebbar's comment on Transformer inductive biases & RASP by Vivek Hebbar (24 Feb 2022 1:34 UTC; 1 point)
Jsevillamol 13 Dec 2021 0:04 UTC
7 points
0
I think this is very impressive and that we could learn a lot from this kind of efforts.

Can you tell us more about your “training” process and the capabilities you can achieve, with examples?
Rohin Shah 16 Dec 2021 18:37 UTC
3 points
0
Very cool!
A note of caution: when I handcoded weights of a neural network (in my case, to solve a gridworld RL problem), I was able to encode the optimal policy—but the algorithm that was later learned by gradient descent was very different. Partly this was because I only required myself to produce the right action, so I often had the (equivalent of) Q-values for different actions be very very close to each other, whereas the neural network ended up having Q-values that were further apart from each other, which was incentivized by the loss function even though it didn’t make a difference to the optimal policy.
So to the extent you’re trying to learn what a neural net trained by gradient descent would do, I’d recommend that you spend some time looking at the trained neural net to see whether it is using a similar sort of algorithm as the one you’re implementing.
- MadHatter 19 Dec 2021 17:44 UTC
  1 point
  0
  Parent
  Agree with this.
Igor Ostrovsky 13 Dec 2021 20:54 UTC
2 points
0
Building up toy transformer models by hand that work … that’s super interesting, both for interpretability and also education.
I put up the site [here](https://igor0.github.io/hand/distill/) for now. MadHatter, let me know if you want me to take it down.