Joseph Bloom 7 Oct 2023 7:59 UTC
LW: 32 AF: 9
18
AF
on: Don’t Dismiss Simple Alignment Approaches
My vibe from this post is something like “we’re making on stuff that could be helpful so there’s stuff to work on!” and this is a vibe I like. However, I suspect that for people who might not be as excited about these approaches, you’re likely not touching on important cruxes (eg: do these approaches really scale? Are some agendas capabilities enhancing? Will these solve deceptive alignment or just corrigible alignment?)

I also think that if the goal is to actually make progress and not to maximize the number of people making progress or who feel like they’re making progress, then engaging with those cruxes is important before people invest substantive energy (ie: beyond upskilling). However as a directional update for people who are otherwise pretty cynical, this seems like a good update.

Features and Adversaries in MemoryDT

Joseph Bloom and Jay Bailey

20 Oct 2023 7:32 UTC

31 points

6 comments25 min readLW link

Joseph Bloom 6 Mar 2024 16:53 UTC
25 points
8
on: Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT
Thanks for posting this! I’ve had a lot of conversations with people lately about OthelloGPT and I think it’s been useful for creating consensus about what we expect sparse autoencoders to recover in language models.
Maybe I missed it but:
- What is the performance of the model when the SAE output is used in place of the activations?
- What is the L0? You say 12% of features active so I assume that means 122 features are active.This seems plausibly like it could be too dense (though it’s hard to say, I don’t have strong intuitions here). It would be preferable to have a sweep where you have varying L0′s, but similar explained variance. The sparsity is important since that’s where the interpretability is coming from. One thing worth plotting might be the feature activation density of your SAE features as compares to the feature activation density of the probes (on a feature density histogram). I predict you will have features that are too sparse to match your probe directions 1:1 (apologies if you address this and I missed this).
- In particular, can you point to predictions (maybe in the early game) where your model is effectively perfect and where it is also perfect with the SAE output in place of the activations at some layer? I think this is important to quantify as I don’t think we have a good understanding of the relationship between explained variance of the SAE and model performance and so it’s not clear what counts as a “good enough” SAE.
I think a number of people expected SAEs trained on OthelloGPT to recover directions which aligned with the mine/their probe directions, though my personal opinion was that besides “this square is a legal move”, it isn’t clear that we should expect features to act as classifiers over the board state in the same way that probes do.
This reflects several intuitions:
1. At a high level, you don’t get to pick the ontology. SAEs are exciting because they are unsupervised and can show us results we didn’t expect. On simple toy models, they do recover true features, and with those maybe we know the “true ontology” on some level. I think it’s a stretch to extend the same reasoning to OthelloGPT just because information salient to us is linearly probe-able.
2. Just because information is linearly probeable, doesn’t mean it should be recovered by sparse autoencoders. To expect this, we’d have to have stronger priors over the underlying algorithm used by OthelloGPT. Sure, it must us representations which enable it to make predictions up to the quality it predicts, but there’s likely a large space of concepts it could represent. For example, information could be represented by the model in a local or semi-local code or deep in superposition. Since the SAE is trying to detect representations in the model, our beliefs about the underlying algorithm should inform our expectations of what it should recover, and since we don’t have a good description of the circuits in OthelloGPT, we should be more uncertain about what the SAE should find.
3. Separately, it’s clear that sparse autoencoders should be biased toward local codes over semi-local / compositional codes due to the L1 sparsity penalty on activations. This means that even if we were sure that the model represented information in a particular way, it seems likely the SAE would create representations for variables like (A and B) and (A and B’) in place of A even if the model represents A. However, the exciting thing about this intuition is it makes a very testable prediction about combinations of features likely combining to be effective classifiers over the board state. I’d be very excited to see an attempt to train neuron-in-a-haystack style sparse probes over SAE features in OthelloGPT for this reason.
Some other feedback:
- Positive: I think this post was really well written and while I haven’t read it in more detail, I’m a huge fan of how much detail you provided and think this is great.
- Positive: I think this is a great candidate for study and I’m very interested in getting “gold-standard” results on SAEs for OthelloGPT. When Andy and I trained them, we found they could train in about 10 minutes making them a plausible candidate for regular / consistent methods benchmarking. Fast iteration is valuable.
- Negative: I found your bolded claims in the introduction jarring. In particular “This demonstrates that current techniques for sparse autoencoders may fail to find a large majority of the interesting, interpretable features in a language model”. I think this is overclaiming in the sense that OthelloGPT is not toy-enough, nor do we understand it well enough to know that SAEs have failed here, so much as they aren’t recovering what you expect. Moreover, I think it would best to hold-off on proposing solutions here (in the sense that trying to map directly from your results to the viability of the technique encourages us to think about arguments for / against SAEs rather than asking, what do SAEs actually recover, how do neural networks actually work and what’s the relationship between the two).
- Negative: I’m quite concerned that tieing the encoder / decoder weights and not having a decoder output bias results in worse SAEs. I’ve found the decoder bias initialization to have a big effect on performance (sometimes) and so by extension whether or not it’s there seems likely to matter. Would be interested to see you follow up on this.
Oh, and maybe you saw this already but an academic group put out this related work: https://arxiv.org/abs/2402.12201 I don’t think they quantify the proportion of probe directions they recover, but they do indicate recovery of all types of features that been previously probed for. Likely worth a read if you haven’t seen it.
What links here?
- Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT by Robert_AIZI (5 Mar 2024 13:55 UTC; 53 points)

Joseph Bloom 9 Mar 2023 20:19 UTC
20 points
8
on: Anthropic’s Core Views on AI Safety

For comparison, others might want to see the DeepMind alignment team’s strategy: https://www.lesswrong.com/posts/a9SPcZ6GXAg9cNKdi/linkpost-deepmind-alignment-team-s-strategy

I think this is the equivalent post for OpenAI but someone feel free to correct me:
https://www.lesswrong.com/posts/28sEs97ehEo8WZYb8/openai-s-alignment-plans

Joseph Bloom 8 May 2023 1:57 UTC
LW: 17 AF: 12
0
AF
on: Residual stream norms grow exponentially over the forward pass
Second pass through this post which solidly nerd-sniped me!
A quick summary of my understand of the post: (intentionally being very reductive though I understand the post may make more subtle points).
1. There appears to be exponential growth in the norm of the residual stream in a range of models. Why is this the case?
2. You consider two hypotheses:
  1. 1. That the parameters in the Attention and/or MLP weights increase later in the network.
  2. 2. That there is some monkey business with the layer norm sneaking in a single extra feature.
3. In terms of evidence, you found that:
  1. Evidence for theory one in W_OV frobenius norms increasing approximately exponential over layers.
  2. Evidence for theory one in MLP output to the residual stream increasing (harder to directly measure the norm of the MLP due to non-linearities).
4. You’re favoured explanation is “We finally note our current favored explanation: Due to LayerNorm, it’s hard to cancel out existing residual stream features, but easy to overshadow existing features by just making new features 4.5% larger. ”
My thoughts:
- My general take is that this post is that the explanation about cancelling out features being harder than amplifying new features feels somewhat disconnected from the high level characterisation of weights / norms which makes up most of the post. It feels like there is a question of how and a question of why.
- Given these models are highly optimized by SGD, it seems like the conclusion must be that the residual stream norm is growing because this is useful leading to the argument that it is useful because the residual stream is a limited resource / has limited capacity, making us want to delete information in it and increasing the norm of the contributions to the residual stream effectively achieves this by drowning out other features.
- Moreover, if the mechanism by which we achieve larger residual stream contributions in later components is by having larger weights (which is penalized by weight decay) then we should conclude that a residual stream with a large norm is worthwhile enough that the model would rather do this then have smaller weights (which you note).
- I feel like I still don’t feel like I know why though. Later layers have more information and are therefore “wiser” or something could be part of it.
- I’d also really like to know the implications of this. Does this affect the expressivity of the model in a meaningful way? Does it affect the relative value of representing a feature in any given part of the model? Does this create an incentive to “relocate” circuits during training or learn generic “amplification” functions? These are all ill-defined questions to some extent but maybe there are formulations of them that are better defined which have implications for MI related alignment work.
Thanks for writing this up! Looking forward to subsequent post/details :)

PS: Is there are non-trivial relationship between this post and tuned lens/logit lens? https://arxiv.org/pdf/2303.08112.pdf Seems possible.

Joseph Bloom 9 May 2023 0:53 UTC
LW: 11 AF: 8
1
AF
in reply to: TurnTrout’s comment on: Residual stream norms grow exponentially over the forward pass
We would love to see more ideas & hypotheses on why the model might be doing this, as well as attempts to test this! We mainly wrote-up this post because both Alex and I independently noticed this and weren’t aware of this previously, so we wanted to make a reference post.
Happy to provide! I think I’m pretty interested in testing this/working on this in the future. Currently a bit tied up but I think (as Alex hints at) there could be some big implications for interpretability here.

TLDR: Documenting existing circuits is good but explaining what relationship circuits have to each other within the model, such as by understanding how the model allocated limited resources such as residual stream and weights between different learnable circuit seems important.
The general topic I think we are getting at is something like “circuit economics”. The thing I’m trying to gesture at is that while circuits might deliver value in distinct ways (such as reducing loss on different inputs, activating on distinct patterns), they share capacity in weights (see polysemantic and capacity in neural networks) and I guess “bandwidth” (getting penalized for interfering signals in activations). There are a few reasons why I think this feels like economics which include: scarce resources, value chains (features composed of other features) and competition (if a circuit is predicting something well with one heuristic, maybe there will be smaller gradient updates to encourage another circuit learning a different heuristic to emerge).

So to tie this back to your post and Alex’s comment “which seems like it would cut away exponentially many virtual heads? That would be awfully convenient for interpretability.”. I think that what interpretability has recently dealt with in elucidating specific circuits is something like “micro-interpretability” and is akin to microeconomics. However this post seems to show a larger trend ie “macro-interpretability” which would possibly affect which of such circuits are possible/likely to be in the final model.

I’ll elaborate briefly on the off chance this seems like it might be a useful analogy/framing to motivate further work.
- Studying the Capacity/Loss Reduction distribution in Time: It seems like during transformer training there may be an effect not unlike inflation? Circuits which delivered enough value to justify their capacity use early in training may fall below the capacity/loss reduction cut off later. Maybe various techniques which enable us to train more robust models work because they make these transitions easier.
- Studying the Capacity/Loss Reduction distribution in Layer: Moreover, it seems plausible that the distribution of “usefulness” in circuits in different layers of the network may be far from uniform. Circuits later in the network have far more refined inputs which make them better at reducing loss. Residual stream norm growth seems like a “macro” effect that shows model “know” that later layers are more important.
- Studying the Capacity/Loss Reduction distribution in Layer and Time: Combining the above. I’d predict that neural networks originally start by having valuable circuits in many layers but then transition to maintain circuits earlier in the network which are valuable to many downstream circuits and circuits later in the network which make the best use of earlier circuits.
- More generally “circuit economics” as a framing seems to suggest that there are different types of “goods” in the transformer economy. those which directly lead to better predictions and those which are useful for making better predictions when integrated with other features. The success of Logit Lens seems to suggest that the latter category increases over the course of the layers. Maybe this is the only kind of good in which case transformers would be “fundamentally interpretable” in some sense. All intermediate signals could be interpreted as final products. More likely, I think is that later in training there are ways to reinforce the creation of more internal goods (in economics, good which are used to make other goods are called capital goods). The value of such goods would be mediated via later circuits. So this would lead also to the “deletion-by-magnitude theory” as a way or removing internal goods.
- To bring this back to language already in the field see Neel’s discussion here. A modular circuit is distinct from an end-end circuit in that it starts and ends in intermediate activations. Modular circuits may be composable. I propose that the outputs of such circuits are “capital goods”. If we think about the “circuit economy” it then seems totally reasonable that multiple suppliers might generate equivalent capital goods and have a many to many relationship multiple different circuits near the end voting on logits.
This is very speculative “theory” if you can call it that, but I guess I feel this would be “big if true”. I also make no claims about this being super original or actually that useful in practice but it does feel intuition generating. I think this is totally the kind of thing people might have worked on sooner but it’s likely been historically hard to measure the kinds of things that might be relevant. What your post shows is that between the transformer circuits framework and TransformerLens we are able to somewhat quickly take a bunch of interesting measurements relatively quickly which may provide more traction on this than previously possible.
What links here?
- TurnTrout's comment on TurnTrout’s shortform feed by TurnTrout (11 Jul 2023 17:14 UTC; 8 points)

Joseph Bloom 20 Apr 2023 21:16 UTC
11 points
10
on: Proposal: Using Monte Carlo tree search instead of RLHF for alignment research
“In this post, I’ll present a way to turn LLMs into agents such that we can approximately model them as a utility maximizer.”

If this works it would be very dangerous and kind of thing we would want to avoid. We’re very lucky current systems are as poorly agentic as they are.

Joseph Bloom 7 Feb 2023 21:41 UTC
10 points
0
in reply to: TurnTrout’s comment on: Decision Transformer Interpretability
The ablations seems surprisingly clear-cut. I consider myself to be very on board with “RL-trained behaviors are contextually activated and modulated”, but even I wasn’t expecting such strong localization.
Neither was I. After the fact, it seems easy to come up with reasons why this might be the case. I think measuring this with something like excluded loss might enable a more precise quantification of exactly how localised it was. I also don’t see strong reasons to expect this to generalise to larger models. If I see similar results when the task is more complicated, the model is bigger and especially with a larger context window, then I will be more interested in trying to precisely describe how/why/when you get more localisation.
Seems to me that randomness wouldn’t prevent the agent from bieng calibrated, because even though any given episode might deviate from the prescribed number of steps, on average the randomness can (presumably) be made to add up to that number. EG it might be hard to bump into the goal after exactly 14 steps due to random obstacles, but I’d imagine ensuring this falls between 10 and 18 steps is feasible?
I think you might be right in the infinite training data regime, where I would expect it to be unbiased however I suspect that the training data being sparse, especially in the low positive RTG reason is enough to make the signal weak. The loss incurred by finishing in the wrong number of steps is likely very small compared to failing when you have positive RTG or succeeding when you have positive RTG, so it could also be that the model doesn’t allocate much capacity to being well-calibrated in this sense.
This seems like an important manifestation of “models don’t ‘get reward’, they are shaped by reward”; even on a simple task where presumably agents can fully explore all relevant options. The e.g. observation encoding (where an obstacle is “3/4″ of a goal) matters when predicting what behavioral shards & subroutines get trained into the policy, or considering what e.g. early-stage policies will be like.
I thought this too and should have remembered to cite “Reward is not the Optimization target”. I feel like that concept is now more visceral for me. A relevant takeaway that may or may not be obvious is that simulators with inductive biases might be better/worse at simulating particular stuff as a function of their inductive biases. In this case, the positive RTG range was more miscalibrated as a result of the bias.
Wait, how? Isn’t the state observation constant in this task? I’m guessing you’re discussing something else?
I’m not sure what you mean by constant. If it were constant then the agent/obstacles wouldn’t be moving? I’ll elaborate a little bit in case that helps.
Since RTG gives information about whether the agent should go forward/not in many contexts, you might have expected (although now I think I wouldn’t) the residual stream embedding for the state token not to directly contribute to one logit over another before RTG has been seen. In practice, it seems like it does. For example, in this situation (picture below from the app) with RTG = 90, there is a wall to the left of the agent and the state appears to strongly encourage forward/right. I interpret this as as “some agent behaviours are independent of RTG and can be encouraged as a function of the observation/state before RTG is seen”.

To clear up some language as well:
- state encoding → how is the state represented? weird minigrid schema.
- state token → the value of a state represented as a vector input to the model.
- state embedding → an internal representation of the state at some point in the model.

I should have written state-token not state-embedding in the quotes paragraph. Apologies if this led to confusion.

Joseph Bloom 9 Dec 2023 12:32 UTC
9 points
0
on: Finding Sparse Linear Connections between Features in LLMs
Interesting! This is very cool work but I’d like to understand your metrics better.
- “So we take the difference in loss for features (ie for a feature, we take linear loss—MLP loss)”. What do you mean here? Is this the difference between the mean MSE loss when the feature is on vs not on?
- Can you please report the L0′s for each of the auto-encoders and the linear model as well as the next token prediction loss when using the autoencoder/linear model. These are important metrics on which my generally excitement hinges. (eg: if those are both great, I’m way more interested in results about specific features).
- I’d be very interested in you can take a specific input, look at the features present and compare them between autoencoder/the linear model. This would be especially cool if you pick an example where ablating the MLP out causes the incorrect prediction so we know it’s representing something important.
- Are you using a holdout dataset of eval tokens when measuring losses? Or how many tokens are you using to measure losses?
- Have you plotted per token MSE loss vs l0 for each model? Do they look similar? Are there any outliers in that relationship?

Joseph Bloom 9 Mar 2023 20:19 UTC
9 points
5
on: Anthropic’s Core Views on AI Safety
Thanks Zac.

My high level take is I found this very useful for understanding Anthropic broader strategy and think that I agree with a lot of the thinking. It definitely seems like some of this research could backfire but Anthropic is aware of that. The rest of my thoughts are below.

I found a lot of value in the examination of different scenarios. I think this provides the clearest explanations for why Anthropic is taking an empirical/portfolio approach. My mental models of people disagreeing with this approach involves them being either more confident about pessimistic (they would say realistic scenarios) or that they disagree with specific research agendas/have favorites. I’m very uncertain about which scenario we live in but in the context of that uncertainty, the portfolio approach seems reasonable.

I think the most contentious part of this post will probably be the arguments in favor of working with frontier models. It seems to me that while this is dangerous, the knowledge required to correctly assess a) whether this is necessary, b) what, if any, results that arise from such research should be published seems closely tied to that work itself (ie: questions like how many safety relevant phenomena just don’t exist in smaller models and how redundant work on small models becomes).

Writing this comment, I feel a strong sense of, “gee, I feel like if anyone would have the insights to know whether this stuff is a safe bet, it would be the teams at Anthropic” and that feels kind of dangerous. Independent oversight such as ARC evals might help us but a strong internal culture of red-teaming different strategies would also be good.

Quoting from the main article, I wanted to highlight some points:
Furthermore, we think that in practice, doing safety research isn’t enough – it’s also important to build an organization with the institutional knowledge to integrate the latest safety research into real systems as quickly as possible.
I think this is a really good point. The actual implementation of many alignment strategies might be exceedingly technically complicated and it seems unlikely that we could attain that knowledge quickly as opposed to over years of working with frontier models.
In a sense one can view alignment capabilities vs alignment science as a “blue team” vs “red team” distinction, where alignment capabilities research attempts to develop new algorithms, while alignment science tries to understand and expose their limitations.
This distinction also seems good to me. If there is work that can’t be published or until functional independent evaluation is working well, then high quality internal red-teaming seems essential.

Joseph Bloom 16 Feb 2024 22:17 UTC
8 points
2
on: Fixing Feature Suppression in SAEs
Awesome work! I’d be quite interested to know whether the benefits from this technique are equivalently significant with a larger SAE and also what the original perplexity was (when looking at the summary statistics table). I’ll probably reimplement at some point.
Also, kudos on the visualizations. Really love the color scales!

Joseph Bloom 4 Jul 2023 3:38 UTC
8 points
2
on: Ten Levels of AI Alignment Difficulty
Thanks for writing this up. I really liked this framing when I first read about it but reading this post has helped me reflect more deeply on it.
I’d also like to know your thoughts on whether Chris Olah’s original framing, that anything which advances this ‘present margin of safety research’ is net positive, is the correct response to this uncertainty.
I wouldn’t call it correct or incorrect only useful in some ways and not others. Whether it’s net positive may rely on whether it is used by people in cases where it is appropriate/useful.
As an educational resource/communication tool, I think this framing is useful. It’s often useful to collapse complex topics into few axes and construct idealised patterns, in this case a difficulty-distribution on which we place techniques by the kinds of scenarios where they provide marginal safety. This could be useful for helping people initially orient to existing ideas in the field or in governance or possibly when making funding decisions.

However, I feel like as a tool to reduce fundamental confusion about AI systems, it’s not very useful. The issue is that many of the current ideas we have in AI alignment are based significantly on pre-formal conjecture that is not grounded in observations of real world systems (see the Alignment Problem from a Deep Learning Perspective). Before we observe more advanced future systems, we should be highly uncertain about existing ideas. Moreover, it seems like this scale attempts to describe reality via the set of solutions which produce some outcome in it? This seems like an abstraction that is unlikely to be useful.
In other words, I think it’s possible that this framing leads to confusion between the map and the territory, where the map is making predictions about tools that are useful in territory which we have yet to observe.

To illustrate how such an axis may be unhelpful if you were trying to think more clearly, consider the equivalent for medicine. Diseases can be divided up into varying classes on difficulty to cure with corresponding research being useful for curing them. Cuts/Scrapes are self-mending whereas infections require corresponding antibiotics/antivirals, immune disorders and cancers are diverse and therefore span various levels of difficulties amongst their instantiations. It’s not clear to me that biologists/doctors would find much use from conjecture on exactly how hard vs likely each disease is to occur, especially in worlds where you lack a fundamental understanding of the related phenomena. Possibly, a closer analogy would be trying to troubleshoot ways evolution can generate highly dangerous species like humans.
I think my attitude here leads into more takes about good and bad ways to discuss which research we should prioritise but I’m not sure how to convey those concisely. Hopefully this is useful.

Joseph Bloom 24 Oct 2022 8:33 UTC
7 points
2
on: AI researchers announce NeuroAI agenda
To the extent that one might not have predicted scientists to hold these views, I can see why this paper might cause a positive predictive update on brainlike AGI.

However, technological development is not a zero-sum game. Opportunities or enthusiasm in neuroscience doesn’t in itself make prosaic AGI less likely and I don’t feel like any of the provided arguments are knockdown arguments against ANN’s leading to prosaic AGI.

I don’t particularly find arguments about human level intelligence being unprecedented outside of humans convincing, in part because of the analogy to “the god of the gaps”. Many predictions about what computers can’t do have been falsified, sometimes in unexpected ways (ie: arguments about AI not being able to make art). Moreover, that more is different in AI and the development of single-shot models seem powerful argument about the potential of prosaic AI systems when scaled.

Joseph Bloom 11 Nov 2022 11:51 UTC
6 points
0
on: Why I’m Working On Model Agnostic Interpretability
Thanks Jessica!
I like 1) and think this is worth doing. I believe that Mechanistic Interpretability researchers are already somewhat concerned about insight not generalising from toy models to larger models let alone to novel architectures so work on model agnostic levels could be useful in the same paradigm too.
Something to note, I’m not confident about the track record of model agnostic methods (such as saliency maps). I’ve heard from at least one ML researcher that saliency maps have a poor track record and have been shown to be unreliable in a variety of experiments. Do you know of any other examples of model agnostic interpretability methods which you think might be very useful? Maybe saliency maps don’t matter as much as the idea of model agnostic methods in which case feel free to disregard this. I’ve heard before of interest in generally approaching models as block boxes “ML psychologist” while we try to understand them so don’t think the value of this approach lies too heavily in specific prior methods.
With respect to 2), while I think this is reasonable, I believe the salient point is whether models from the current paradigm are sufficiently dangerous fast enough that they warrant more/less focus. Theoretically, the space of possible ML architecture paradigms producing doom is large and the order in which they will manifest is roughly the order in which we should solve them. (ie: align current systems, then new paradigm systems, then new paradigm systems, each buying time).
However, I think there are good enough reasons to work on model agnostic methods that don’t rely on AGI doom originating in a new paradigm.
Overall, very exciting! good luck!

Joseph Bloom 9 Mar 2024 5:38 UTC
5 points
2
in reply to: jacquesthibs’s comment on: How to train your own “Sleeper Agents”
Depending on model size I’m fairly confident we can train SAEs and see if they can find relevant features (feel free to dm me about this).

Joseph Bloom 26 Oct 2023 19:32 UTC
5 points
2
on: [Paper] All’s Fair In Love And Love: Copy Suppression in GPT-2 Small
Cool paper. I think the semantic similarity result is particularly interesting.

As I understand it you’ve got a circuit that wants to calculate something like Sim(A,B), where A and B might have many “senses” aka: features but the Sim might not be a linear function of each of thes Sims across all senses/features.

So for example, there are senses in which “Berkeley” and “California” are geographically related, and there might be a few other senses in which they are semantically related but probably none that really matter for copy suppression. For this reason wouldn’t expect the tokens of each of to have cosine similarity that is predictive of the copy suppression score. This would only happen for really “mono-semantic tokens” that have only one sense (maybe you could test that).
Moreover, there are also tokens which you might want to ignore when doing copy suppression (speculatively). Eg: very common words or punctuations (the/and/etc).

I’d be interested if you have use something like SAE’s to decompose the tokens into the underlying feature/s present at different intensities in each of these tokens (or the activations prior to the key/query projections). Follow up experiments could attempt to determine whether copy suppression could be better understood when the semantic subspaces are known. Some things that might be cool here:
- Show that some features are mapped to the null space of keys/queries in copy suppression heads indicating semantic senses / features that are ignored by copy suppression. Maybe multiple anti-induction heads compose (within or between layers) so that if one maps a feature to the null space, another doesn’t (or some linear combination) or via a more complicated function of sets of features being used to inform suppression.
- Similarly, show that the OV circuit is suppressing the same features/features you think are being used to determine semantic similarity. If there’s some asymmetry here, that could be interesting as it would correspond to “I calculate A and B as similar by their similarity in the *california axis* but I suppress predictions of any token that has the feature for anywhere on the West Coast*).

I’m particularly excited about this because it might represent a really good way to show how knowing features informs the quality of mechanistic explanations.

Joseph Bloom

Open Source Sparse Au­toen­coders for all Resi­d­ual Stream Lay­ers of GPT2-Small

De­ci­sion Trans­former Interpretability

Un­der­stand­ing SAE Fea­tures with the Logit Lens

A Mechanis­tic In­ter­pretabil­ity Anal­y­sis of a GridWorld Agent-Si­mu­la­tor (Part 1 of N)

Fea­tures and Ad­ver­saries in MemoryDT

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small

Decision Transformer Interpretability

Understanding SAE Features with the Logit Lens

A Mechanistic Interpretability Analysis of a GridWorld Agent-Simulator (Part 1 of N)

Features and Adversaries in MemoryDT