I’m an independently funded AI Alignment Research Engineer focussing on mechanistic interpretability in reinforcement learning. I’m particularly interested in comparing circuits in decision transformers to those generated by other techniques.
Joseph Bloom
Joseph Bloom on choosing AI Alignment over bio, what many aspiring researchers get wrong, and more (interview)
Thanks for writing this up. I really liked this framing when I first read about it but reading this post has helped me reflect more deeply on it.
I’d also like to know your thoughts on whether Chris Olah’s original framing, that anything which advances this ‘present margin of safety research’ is net positive, is the correct response to this uncertainty.
I wouldn’t call it correct or incorrect only useful in some ways and not others. Whether it’s net positive may rely on whether it is used by people in cases where it is appropriate/useful.
As an educational resource/communication tool, I think this framing is useful. It’s often useful to collapse complex topics into few axes and construct idealised patterns, in this case a difficulty-distribution on which we place techniques by the kinds of scenarios where they provide marginal safety. This could be useful for helping people initially orient to existing ideas in the field or in governance or possibly when making funding decisions.
However, I feel like as a tool to reduce fundamental confusion about AI systems, it’s not very useful. The issue is that many of the current ideas we have in AI alignment are based significantly on pre-formal conjecture that is not grounded in observations of real world systems (see the Alignment Problem from a Deep Learning Perspective). Before we observe more advanced future systems, we should be highly uncertain about existing ideas. Moreover, it seems like this scale attempts to describe reality via the set of solutions which produce some outcome in it? This seems like an abstraction that is unlikely to be useful.In other words, I think it’s possible that this framing leads to confusion between the map and the territory, where the map is making predictions about tools that are useful in territory which we have yet to observe.
To illustrate how such an axis may be unhelpful if you were trying to think more clearly, consider the equivalent for medicine. Diseases can be divided up into varying classes on difficulty to cure with corresponding research being useful for curing them. Cuts/Scrapes are self-mending whereas infections require corresponding antibiotics/antivirals, immune disorders and cancers are diverse and therefore span various levels of difficulties amongst their instantiations. It’s not clear to me that biologists/doctors would find much use from conjecture on exactly how hard vs likely each disease is to occur, especially in worlds where you lack a fundamental understanding of the related phenomena. Possibly, a closer analogy would be trying to troubleshoot ways evolution can generate highly dangerous species like humans.I think my attitude here leads into more takes about good and bad ways to discuss which research we should prioritise but I’m not sure how to convey those concisely. Hopefully this is useful.
Hey Robert, great work! My focus isn’t currently on this but I thought I’d mention that these trends might relate to some of the observations in the Finding Neurons in a Haystack paper. https://arxiv.org/abs/2305.01610.
If you haven’t read the paper, the short version is they used sparse probing to find neurons which linearly encode variables like “is this python code” in a variety of models with varying size.
The specific observation which I believe may be relevant:
“As models increase in size, representation sparsity increases on average, but different features obey different dynamics: some features with dedicated neurons emerge with scale, others split into finer grained features with scale, and many remain unchanged or appear somewhat randomly”
I believe this accords with your observation that “Finding more features finds more high-MCS features, but finds even more low-MCS features”.
Maybe finding ways to directly compare approaches could support further use of either approach.
Also, interesting to hear about using EMD over KL divergence. I hadn’t thought about that!
Thanks Simon, I’m glad you found the app intuitive :)
The RTG is just another token in the input, except that it has an especially strong relationship with training distribution. It’s heavily predictive in a way other tokens aren’t because it’s derived from a labelled trajectory (it’s the remaining reward in the trajectory after that step).
For BabyAI, the idea would be to use an instruction prepended to the trajectory made up of a limited vocab (see baby ai paper for their vocab). I would be pretty partial to throwing out the RTG and using a behavioral clone for a BabyAI model. It seems likely this would be easier to train. Since the goal of these models is to be useful for gaining understanding, I’d like to avoid reusing tokens as that might complicate analysis later on.
A Mechanistic Interpretability Analysis of a GridWorld Agent-Simulator (Part 1 of N)
Sure. Let’s do it at EAG. :)
Really exciting! I added a version of AVEC to my interpretability tool for gridworld agents and am keen to explore it more. I really like that the injection coefficient has a scalar and this had enabled me to do what I can “an injection coefficient scan”.
The procedure I’m using looks like this:Repeat your input tokens say, 128 times.
Apply the activation vector at 128 different steps between a coefficient of −10 and 10 to each of your input tokens when doing your AVEC forward pass.
Decompose the resulting residual stream to whatever granularity you like (use decompose_resid or get_full_resid_decomposition with/without expand neurons).
Dot product the outputs with your logit direction of choice ( I use a logit diff that is meaningful in my task)
Plot the resulting attribution vs injection coefficient per component.
If you like, cluster the profiles to show how different component learn similar functions of the injection coefficient to your decision.
So far, my results seem very interesting and possibly quite useful. It’s possible this method is impractical in LLMs but I think it might be fine as well. Will dm some example figures.
I also want to investigate using a continuous injection coefficient in activation patching is similarly useful since it seems like it might be.
I am very excited to see if this makes my analyses easier! Great work!
Sure, I could have phrased myself better and I meant to say “former”, which didn’t help either!
Neither of these are novel concepts in that existing investigations have described features of this nature.Good 1 aka Consumer goods. Useful for unembed (may / may not be useful for other modular circuits inside the network. That Logit Lens gets better over the course of the circuit suggests the residual stream contains these kinds of features and more so as we move up the layers.
Good 2. aka Capital goods. Useful primarily for other circuits. A good example is the kind of writing to subspaces in the IOI circuits by duplicate token heads. “John” appeared twice as markup on a token / vector in the subspace of a token in the residual stream” doesn’t in itself tell you that Jane is the next token, but is useful to another head which is going to propose a head via another function.
Alternatively, in Neel’s modular arithmetic, calculating waves of terms like sin(wx), cos(wx) which are only useful when you have the rest of the mechanism to get argmax(z) of
cos(w(x+y))cos(wz)+sin(w(x+y))sin(wz)=cos(w(x+y−z)).I would have guess that features in the first category and later in the second, since how would you get gradients to things that aren’t useful yet. However, the existence of clear examples of “internal signals” is somewhat undisputable?
It seems plausible that there are lots of stuff features that sit in both these categories of course so if it’s useful you could define them to be more mutually exclusive and a third category for both.
I realise that my saying “Maybe this is the only kind of good in which case transformers would be “fundamentally interpretable” in some sense. All intermediate signals could be interpreted as final products.” was way too extreme. What I mean is that maybe category two is more less common that we think.
To relate this to AVEC, (which I don’t have a detailed understanding of how you are implementing currently) if you find the vector (I assume residual stream vector) itself has a high dot product with specific unembeddings then that says you’re looking at something in category 1. However, if introducing it into the model earlier has a very different effect to introducing it directly before the unembedding then that would suggest it’s also being used by other modular circuits in the model.
I think this kind of distinction is only one part of what I was trying to get at with circuit economics but hopefully that’s clearer! Sorry for the long explanation and initial confusion.
We would love to see more ideas & hypotheses on why the model might be doing this, as well as attempts to test this! We mainly wrote-up this post because both Alex and I independently noticed this and weren’t aware of this previously, so we wanted to make a reference post.
Happy to provide! I think I’m pretty interested in testing this/working on this in the future. Currently a bit tied up but I think (as Alex hints at) there could be some big implications for interpretability here.
TLDR: Documenting existing circuits is good but explaining what relationship circuits have to each other within the model, such as by understanding how the model allocated limited resources such as residual stream and weights between different learnable circuit seems important.The general topic I think we are getting at is something like “circuit economics”. The thing I’m trying to gesture at is that while circuits might deliver value in distinct ways (such as reducing loss on different inputs, activating on distinct patterns), they share capacity in weights (see polysemantic and capacity in neural networks) and I guess “bandwidth” (getting penalized for interfering signals in activations). There are a few reasons why I think this feels like economics which include: scarce resources, value chains (features composed of other features) and competition (if a circuit is predicting something well with one heuristic, maybe there will be smaller gradient updates to encourage another circuit learning a different heuristic to emerge).
So to tie this back to your post and Alex’s comment “which seems like it would cut away exponentially many virtual heads? That would be awfully convenient for interpretability.”. I think that what interpretability has recently dealt with in elucidating specific circuits is something like “micro-interpretability” and is akin to microeconomics. However this post seems to show a larger trend ie “macro-interpretability” which would possibly affect which of such circuits are possible/likely to be in the final model.
I’ll elaborate briefly on the off chance this seems like it might be a useful analogy/framing to motivate further work.Studying the Capacity/Loss Reduction distribution in Time: It seems like during transformer training there may be an effect not unlike inflation? Circuits which delivered enough value to justify their capacity use early in training may fall below the capacity/loss reduction cut off later. Maybe various techniques which enable us to train more robust models work because they make these transitions easier.
Studying the Capacity/Loss Reduction distribution in Layer: Moreover, it seems plausible that the distribution of “usefulness” in circuits in different layers of the network may be far from uniform. Circuits later in the network have far more refined inputs which make them better at reducing loss. Residual stream norm growth seems like a “macro” effect that shows model “know” that later layers are more important.
Studying the Capacity/Loss Reduction distribution in Layer and Time: Combining the above. I’d predict that neural networks originally start by having valuable circuits in many layers but then transition to maintain circuits earlier in the network which are valuable to many downstream circuits and circuits later in the network which make the best use of earlier circuits.
More generally “circuit economics” as a framing seems to suggest that there are different types of “goods” in the transformer economy. those which directly lead to better predictions and those which are useful for making better predictions when integrated with other features. The success of Logit Lens seems to suggest that the latter category increases over the course of the layers. Maybe this is the only kind of good in which case transformers would be “fundamentally interpretable” in some sense. All intermediate signals could be interpreted as final products. More likely, I think is that later in training there are ways to reinforce the creation of more internal goods (in economics, good which are used to make other goods are called capital goods). The value of such goods would be mediated via later circuits. So this would lead also to the “deletion-by-magnitude theory” as a way or removing internal goods.
To bring this back to language already in the field see Neel’s discussion here. A modular circuit is distinct from an end-end circuit in that it starts and ends in intermediate activations. Modular circuits may be composable. I propose that the outputs of such circuits are “capital goods”. If we think about the “circuit economy” it then seems totally reasonable that multiple suppliers might generate equivalent capital goods and have a many to many relationship multiple different circuits near the end voting on logits.
This is very speculative “theory” if you can call it that, but I guess I feel this would be “big if true”. I also make no claims about this being super original or actually that useful in practice but it does feel intuition generating. I think this is totally the kind of thing people might have worked on sooner but it’s likely been historically hard to measure the kinds of things that might be relevant. What your post shows is that between the transformer circuits framework and TransformerLens we are able to somewhat quickly take a bunch of interesting measurements relatively quickly which may provide more traction on this than previously possible.
- 11 Jul 2023 17:14 UTC; 8 points) 's comment on TurnTrout’s shortform feed by (
Second pass through this post which solidly nerd-sniped me!
A quick summary of my understand of the post: (intentionally being very reductive though I understand the post may make more subtle points).
There appears to be exponential growth in the norm of the residual stream in a range of models. Why is this the case?
You consider two hypotheses:
1. That the parameters in the Attention and/or MLP weights increase later in the network.
2. That there is some monkey business with the layer norm sneaking in a single extra feature.
In terms of evidence, you found that:
Evidence for theory one in W_OV frobenius norms increasing approximately exponential over layers.
Evidence for theory one in MLP output to the residual stream increasing (harder to directly measure the norm of the MLP due to non-linearities).
You’re favoured explanation is “We finally note our current favored explanation: Due to LayerNorm, it’s hard to cancel out existing residual stream features, but easy to overshadow existing features by just making new features 4.5% larger. ”
My thoughts:
My general take is that this post is that the explanation about cancelling out features being harder than amplifying new features feels somewhat disconnected from the high level characterisation of weights / norms which makes up most of the post. It feels like there is a question of how and a question of why.
Given these models are highly optimized by SGD, it seems like the conclusion must be that the residual stream norm is growing because this is useful leading to the argument that it is useful because the residual stream is a limited resource / has limited capacity, making us want to delete information in it and increasing the norm of the contributions to the residual stream effectively achieves this by drowning out other features.
Moreover, if the mechanism by which we achieve larger residual stream contributions in later components is by having larger weights (which is penalized by weight decay) then we should conclude that a residual stream with a large norm is worthwhile enough that the model would rather do this then have smaller weights (which you note).
I feel like I still don’t feel like I know why though. Later layers have more information and are therefore “wiser” or something could be part of it.
I’d also really like to know the implications of this. Does this affect the expressivity of the model in a meaningful way? Does it affect the relative value of representing a feature in any given part of the model? Does this create an incentive to “relocate” circuits during training or learn generic “amplification” functions? These are all ill-defined questions to some extent but maybe there are formulations of them that are better defined which have implications for MI related alignment work.
Thanks for writing this up! Looking forward to subsequent post/details :)
PS: Is there are non-trivial relationship between this post and tuned lens/logit lens? https://arxiv.org/pdf/2303.08112.pdf Seems possible.
Thanks for the feedback. On a second reading of this post and the paper I linked and having read the paper you linked, my thoughts have developed significantly. A few points I’ll make here before making a separate comment:
- The post I shared originally does indeed focus on dynamics but may have relevant general concepts in discussing the relationship between saturation and expressivity. However, it focuses on the QK circuit which is less relevant here.
- My gut feel is that true explanations of related formula should have non-trivial relationships. If you had a good explanation for why norms of parameters grew during training it should relate to why norms of parameters are different across the model. However, this is a high level argument and the content of your post does of course directly address a different phenomenon (residual stream norms). If this paper had studied the training dynamics of the residual stream norm, I think it would be very relevant.
I really liked this post and would like to engage with it more later. It could be very useful!
However, I also think that it would be good for you to add a section reviewing previous academic work on this topic (eg: https://aclanthology.org/2021.emnlp-main.133.pdf. This seems very relevant and may not be the only academic work on this topic (I did not search long). Curious to hear what you find!
Thanks Richard for this post and prior advice!
I was planning to make a post at some point with some advice that’s closely related to this post but I will share it here as a preview. Take note that I don’t yet have strong evidence that my work is good or has mattered (and I was going to write a full post once I had more evidence for that). I think Richard’s advice above is really good and I’ll try to take some of the ideas more on board with my own work.Last year I quit my job and upskilled for 6 months and now I’m doing independent research which might turn out to be valuable. Regardless of its value, I’ve learnt a lot and it’s created many opportunities for me. I went to EAG and Richard’s talk there and a conversation later in a group where he was talking about this mentorship constraint deal. This left a strong impression on me leading me to take some degree of pride in my attempts to be independent and not rely as strongly on individual mentorship. However, there are just a bunch of caveats/perspectives that I have currently which relate to this.
All of these relate to empirical alignment research and not governance or other forms of research. I’m mostly focussed on providing advice for how to be more productive independently of other people but that shouldn’t be your preference and I suspect people are more productive at orgs/in groups.
So a bunch of ideas on the topic:
Why the focus on independent research?
I think it’s really weird how we have this thing in the alignment community and I just want to comment on that first. The idea that people can just go off on their own and be productive I think is kinda uncommon.
This community values agency. In practice, agency is the ability to make decisions for yourself about what you want and how to achieve it. Getting good at having agency both makes good researchers and good research engineers. HPMOR helped me understand agency better.
I have no first hand knowledge of the inside of orgs like DeepMind or Anthropic but I suspect people with agency are generally considered better hires. It’s not like orgs would say “we could hire this person but we want them to do what they’re told so let’s hire someone with little evidence of working independently”. Rather, my guess is they select for people who are capable of being self-directed work and who grow spontaneously as a result on attempting hard things and learning.
Getting ready to contribute:
There are a variety of ways to upskill without doing stuff like a PhD (as research says above). Programs like ARENA, SERI-MATS, SPAR etc. My sense is that once people realise that working without feedback is hard, they will gravitate strongly toward more empirical research areas, especially those that can be done at small scale (aka, MI) and which there are existing tools (aka MI) and examples of investigations with reasoned paths to impact (aka MI). However, there are likely other empirical areas which provide feedback and are doable that may/may not have these properties and searching for more seems like a good idea to me.
Get good. (especially if you’re fresh out of college) Struggling with implementing things / understanding stuff and identifying problems can all be alleviated by working on your skills. There’s lots of open source code which show good patterns but also lots of people who have good technical skills but aren’t necessarily the research mentors we are constrained by who you can engage with. Talk to people who are good engineers and find out how they operate. It’ll be stuff like having a good IDE and testing your code.
Start slow. Contributing to open source projects such as TransformerLens is good. I’ve been helping out with it and it seems like a good way for lots of people to dip their toe in.
Doing research without a mentor is very hard for many obvious reasons. Things you can do to make it easier:
While talking to people such at EAGs can be helpful, my sense is most good advice just exists on the forum. I recommend rereading such advice periodically and predict you will grok why people make suggestions more if you are stuck in your own research and have challenges then before.
Focus on fast feedback cycles. Try to avoid situations where you don’t know if something is working for a long time. This is different to whether you know if it’s valuable or not.
Be prepared to drop things or change your path, but don’t abandon work because it’s hard. It feels like a special kind of wisdom/insight to make these calls and I think you need to work hard at trying to get better at this over time.
Have good tooling but don’t let building the tooling take over.
Allow yourself to focus. There is a time to work out why you are doing what you are doing and there are other times you just need to do the work.
Study! Engaging with fundamental topics like linear algebra or deep learning theory is hugely important. Without colleagues or mentors it is a meaningful constraint on your output when you don’t know any given thing that might be relevant. This is tricky because there’s a lot to study. I think engage with the basics and be consistent. Mathematics for ML textbook is great. GoodFellow Deep Learning textbook is also recommended.
Read related literature. Like with more basic knowledge, lack of knowledge of relevant literature can cause you to waste time/effort. I have a spreadsheet which describes all the models that are kinda similar to mine and how they were trained and what was done with them.
Find less correlated ideas/takes: Stephen Casper’s Engineering interpreatibility sequence is a good example of the kind of thing people doing independent work should read. It shakes you out of the “everything we do here makes sense and is obvious perspective” which is extra easy to fall into when you work on your own. There might be equivalent posts in other areas.
Possibly the quirkiest thing I do these days is roleplay characters in my head “the engineer”, “the scientist”, “the manager” and the “outsider” who help me balance different priorities when making decisions about my work. I find this fun and useful and since I literally write meeting notes, GPT4 can stand in for each of them which is pretty cool and useful for generating ideas. The “other”, a less obvious team member, represents someone who doesn’t privilege the project or existing decisions. This helps me try to channel a helpful adversarial perspective (see previous point).
I hope this is useful for people!
Some thoughts (don’t have any background in stuff related but seemed interesting).
I think it would be interesting to see what you found if you looked into the state of existing research on AI coordination / delegation / systemic interactions and if any of it feels related. I’d be mildly surprised if people have studied exactly this but expect many relevant posts/papers.
In terms of related stuff on LessWrong, I can’t find it now but Paul Christiano has a post on worlds where things go badly slowly and I think this would be kinda in that genre.
I think this is an interesting thing to consider and feels somewhat related to Dan Hendrycks Natural “Selection Favors AIs over Humans” https://arxiv.org/abs/2303.16200. The connection in my head is “what does an AI ecosystem look like”, “what does it mean to discuss alignment in this context”, “what outcomes will this system tend towards” etc. The same way middle managers get selected for, so more generally AI systems with certain properties get selected for.You might want to read about Ought’s agenda with supervise processes not outcomes which feels relevant.
Recursive middle manager hell feels somewhat related to inner misalignment / misaligned mesa-optimizers where instead of being a subset of the processing of an LLM (how I normally think about it but maybe not how others do), you have your AI system made of many layers and it’s plausible that intermediate layers end up optimizing proxies for inputs to what you care about and not even the thing itself. In this view, it seems like the misalignment of middle managers which usually makes companies less effective might just lead to selection against such systems as compared to systems with less of these properties.There might be some strategically valuable research to be done here but it’s not super clear to me what the theory of change would be. Maybe there something to do with bandwidth / scalability tradeoffs that affect how tightly coupled vs diffuse/distributed useful/popular AI systems will be in the future.
Thanks Spencer! I’d love to respond in detail but alas, I lack the time at the moment.
Some quick points:I’m also really excited about SLT work. I’m curious to what degree there’s value in looking at toy models (such as Neel’s grokking work) and exploring them via SLT or to what extent reasoning in SLT might be reinvigorated by integrating experimental ideas/methodology from MI (such as progress measures). It feels plausible to me that there just haven’t been enough people in any of a number of intersections look at stuff and this is a good example. Not sure if you’re planning on going to this: https://www.lesswrong.com/posts/HtxLbGvD7htCybLmZ/singularities-against-the-singularity-announcing-workshop-on but it’s probably not in the cards for me. I’m wondering if promoting it to people with MI experience could be good.
I totally get what you’re saying about toy model in sense A or B doesn’t necessarily equate to a toy model being a version of the hard part of the problem. This explanation helped a lot, thank you!
I hear what you are saying about next steps being challenging for logistical and coordination issues and because the problem is just really hard! I guess the recourse we have is something like: Look for opportunities/chances that might justify giving something like this more attention or coordination. I’m also wondering if there might be ways of dramatically lowering the bar for doing work in related areas (eg: the same way Neel writing TransformerLens got a lot more people into MI).
Looking forward to more discussions on this in the future, all the best!
I don’t buy that argument at all. “text space” seems to have been adequate to get to GPT3 which is incredibly impressive and useful in a variety of ways. Furthermore, what proof do you have that resulting insights wouldn’t transfer to multi-modal systems like GPT4 (which can see) or Palm-E which is embodied and can see and operate in “text space”. Moreover, I’m not the first to point out that text space seems to incentivize models develop highly sophisticated thinking abilities which seem like the more important thing to focus on.
Maybe I’m not super across simulator theory. If you finetuned your LLM or RLHF’d it into not having this property (which I assume is possible), then does it cease to be a simulator? If so, what do we call the resulting thing?
Maybe proposing alternative hypotheses which makes distinctly different predictions in some scenario could be interesting.
Unrelated but I think I once did a mini experiment where I tried to get gpt3 to fail intentionally with prompts like “write a recipe but forget an ingredient” with the idea being that that once the model is finetuned to be performant it will struggle to choose precise positions to fail when told to, this seemed interesting but wasn’t sure where to take it. Maybe this will be useful to you.
“In this post, I’ll present a way to turn LLMs into agents such that we can approximately model them as a utility maximizer.”
If this works it would be very dangerous and kind of thing we would want to avoid. We’re very lucky current systems are as poorly agentic as they are.
I think the SERI-MATS organisers likely defer to mentors on details or curriculum/research philosophy. It feels like this letter should be addressed to them. I think it’s hard to disagree with “Alignment researchers should think hard about their research methodology and strategy” but I think you are probably failing to pass the ideological turing test for various other positions which will mean this arguments won’t connect.
Strong disagree. Can’t say I’ve worked through the entire article in detail but wanted to chime in as one of the many of junior researchers investing energy in interpretability. Noting that you erred on the side of making arguments too strong. I agree with Richard about this being the wrong kind of reasoning for novel scientific research and with Rohin’s idea that we’re creating new affordances. I think generally MI is grounded and much closer to being a natural science that will progress over time and be useful for alignment, synergising with other approaches. I can’t speak for Neel, but I suspect the original list was more about getting something out there than making many nuanced arguments, so I think it’s important to steelman those kinds of claims / expand on them before responding.
A few extra notes:
The first point I want to address your endorsement of “retargeting the search” and finding the “motivational API” within AI systems which is my strongest motivator for working in interpretability.
I predict that methods like “steering vectors” are more likely to work in worlds where we make much more progress in understanding of neural networks. But steering vectors are relatively recent, so it seems reasonable to think that we might have other ideas soon that could be equally useful but may require progress more generally in the field.
We need only look to biology and medicine to see examples of imperfectly understood systems, which remain mysterious in many ways, and yet science has led us to impressive feats that might have been unimaginable years prior. For example, the ability in recent years to retarget the immune system to fight cancer. Because hindsight devalues science we take such technologies for granted and I think this leads to a general over-skepticism about fields like interpretability.
The second major point I wanted to address was this argument:
I agree that it makes little sense to think of a feature on it’s own as dangerous but I it sounds to me like you are making a point about emergence. If understanding transistors doesn’t lead to understanding computer software then why work so hard to understand transistors?
I am pretty partial to the argument that the kinds of alignment relevant phenomena in neural networks will not be accessible via the same theories that we’re developing today in mechanistic interpretability. Maybe these phenomena will exist in something analogous to a “nervous system” while we’re still understanding “biochemistry”. Unlike transistors and computers though, biochemistry is hugely relevant to understanding neuroscience.