SERIMATS scholar researching Mechanistic Interpretability of Transformers. Trying to figure out AI Alignment. Final-year PhD student in Astronomy, University of Cambridge. Website: StefanHex.com
StefanHex
How-to Transformer Mechanistic Interpretability—in 50 lines of code or less!
Your language model game(s) are really interesting—I’ve had a couple ideas when “playing” (such as adding GPT2-small suggestions for the user to choose from, some tokenization improvements) -- are you happy to share the source / tools to build this website or is it not in a state you would be happy to share? Totally fine if not, just realized that I should ask before considering building something!
Edit for future readers: Managed to do this with Heroku & flask, then switched to Streamlit—code here, mostly written by ChatGPT: https://huggingface.co/spaces/StefanHex/simple-trafo-mech-int/tree/main
I really appreciated all the observations here and enjoyed reading this post, thank you for writing all this up!
Edit: Found it here! https://github.com/socketteer/loom/ Your setup looks quite useful, with all the extra information—is it available publicly somewhere / would you be happy to share it, or is the tooling not in that state yet? (Totally fine, just thought I’d ask!)
Reinforcement Learning Goal Misgeneralization: Can we guess what kind of goals are selected by default?
Firstly thank you for writing this post, trying to “poke holes” into the “AGI might doom us all” hypothesis. I like to see this!
How is the belief in doom harming this community?
Actually I see this point, “believing” in “doom” can often be harmful and is usually useless.
Yes, being aware of the (great) risk is helpful for cases like “someone at Google accidentally builds an AGI” (and then hopefully turns it off since they notice and are scared).
But believing we are doomed anyway is probably not helpful. I like to think along the lines of “condition on us winning”, to paraphrase HPMOR¹. I.e. assume we survive AGI, what could have caused us to survive AGI and work on making those options reality / more likely.
every single plan [...] can go wrong
I think the crux is that the chance of AGI leading to doom is relatively high, where I would say 0.001% is relatively high whereas you would say that is low? I think it’s a similar argument to, say, pandemic-preparedness where there is a small chance of a big bad event and even if the chance is very low, we still should invest substantial resources into reducing the risk.
So maybe we can agree on something like Doom by AGI is a sufficiently high risk that we should spend say like 1-millionth world GDP ($80m) on preventing it somehow (AI Safety research, policy etc).
All fractions mentioned above picked arbitrarily.
Suppose, said that last remaining part, suppose we try to condition on the fact that we win this, or at least get out of this alive. If someone TOLD YOU AS A FACT that you had survived, or even won, somehow made everything turn out okay, what would you think had happened -
Transformer Research Questions from Stained Glass Windows
CNN feature visualization in 50 lines of code
Image interpretability seems mostly so easy because humans are already really good
Thank you, this is a good point! I wonder how much of this is humans “doing the hard work” of interpreting the features. It raises the question of whether we will be able to interpret more advanced networks, especially if they evolve features that don’t overlap with the way humans process inputs.
The language model idea sounds cool! I don’t know language models well enough yet but I might come back to this once I get to work on transformers.
I think I found the problem: Omega is unable to predict your action in this scenario, i.e. the assumption “Omega is good at predicting your behaviour” is wrong / impossible / inconsistent.
Consider a day where Omicron (randomly) chose a prime number (Omega knows this). Now an EDT is on their way to the room with the boxes, and Omega has to put a prime or non-prime (composite) number into the box, predicting EDT’s action.
If Omega makes X prime (i.e. coincides) then EDT two-boxes and therefore Omega has failed in predicting.
If Omega makes X non-prime (i.e. numbers don’t coincide) then EDT one-boxes and therefore Omega has failed in predicting.
Edit: To clarify, EDT’s policy is two-box if Omega and Omicron’s numbers coincide, one-box if they don’t.
This scenario seems impossible, as in contradictory / not self-consistent. I cannot say exactly why it breaks, but at least the two statements here seem to be not consistent:
today they [Omicron] happen to have selected the number X
and
[Omega puts] a prime number in that box iff they predicted you will take only the big box
Both of these statements have implications for X and cannot both be always true. The number cannot both, be random, and be chosen by Omega/you, can it?
From another angle, the statement
FDT will always see a prime number
demonstrates that something fishy is going on. The “random” number X that Omicron has chosen—and is in the box—and seen my FDT—is “always prime”. Then it is not a random number?
Edit: See my reply below, the contradiction is that Omega cannot predict EDT’s behaviour when Omicron chose a prime number. EDT’s decision depends on Omega’s decision, and EDT’s decision depends on Omega’s decision (via the “do the numbers coincide” link). On days where Omicron chooses a prime number this cyclic dependence leads to a contradiction / Omega cannot predict correctly.
Nice argument! My main caveats are
* Does training scale linearly? Does it take just twice as much time to get someone to 4 bits (top 3% in world, one in every school class) and from 4 to 8 bits (one in 1000)?
* Can we train everything? How much of e.g. math skills are genetic? I think there is research on this
* Skills are probably quite highly correlated, especially when it comes to skills you want in the same job. What about computer skills / programming and maths skills / science—are they inherently correlated or is it just because the same people need both? [Edit: See point made by Gunnar_Zarncke above, better argument on this]
That is a very broad description—are you talking about locating Fast Radio Bursts? I would be very surprised if that was easily possible.
Background: Astronomy/Cosmology PhD student
Awesome, updated!