jake_mendel

Karma: 1,477

technical AI safety program associate at OpenPhil

jake_mendel 18 Feb 2026 12:11 UTC
26 points
4
in reply to: Rohin Shah’s comment on: rohinmshah’s Shortform
I think it is reasonable to say that your plan is not ‘have the AIs do your homework’ to the extent that your research has been roadmapped by humans. This is a spectrum, here’s some points on it:
1. Using AIs to monitor for potential reward hacks, or using AIs as automated interpretability agents. At a stretch, maybe I’d call this ‘have the AIs do our reward hacking homework’ or similar. We’re here already.
2. Using AIs to come up with new debate algorithms that better satisfy desiderata outlined by humans, or SAE variants that perform better on SAEbench. We might call this ‘have the AIs do our algorithm design homework’.
3. Using the AIs to make progress on a goal we can’t articulate super precisely. eg make progress in interpretability/design training processes that incentivise honesty better (including iterating on algorithms and desiderata/evals, and whatever else interpretability researchers do). Insofar as we’ve pulled out a particular approach to AI alignment, and we sorta know what we’re looking for here, Maybe we could call this ‘have the AIs do our interpretability/honesty training homework’.
  - The distinction between (3) and (4) is pretty fuzzy, and if you’re expecting ‘doing our interpretability homework’ to involve a bunch of conceptual research and ultimately radically different techniques, and humans are hardly involved in any of that, then it’s getting closer to what I think we should mean by ‘doing our alignment homework’.
4. Throwing up our hands, fully handing off every part of the research process (except possibly a small amount of checking the AIs’ work), and just asking the AIs to make sure that future AI systems aren’t misaligned. This is definitely getting AIs to do our alignment homework.
We can avoid ending up towards the latter end of the spectrum in two ways that I can think of:
1. We expect the most scalable alignment approaches of today to scale as far as capabilities does. It’s not clear to me from your message whether you think this is true?
2. If our most scalable current approaches don’t scale all the way through takeoff, then we (humans) will be able to be keeping up and being critically intellectually involved with developing new alignment techniques when our existing ones stop scaling, at least at the level of granularity of (3) so that we can continue to leverage AIs in ways (1-3) above. We need to keep being able to do this indefinitely, or until we come up with an approach to alignment that does scale as far as capabilities will.
Both of these possibilities seem quite fraught to me. If you believe (1), I’d be grateful if you could point me in the direction of arguments that we have approaches that will scale all the way (and if you just mean that interpretability, broadly construed enough to include arbitrary conceptual progress and changes, will scale, then as I mention above I think it’s reasonable-ish to call that giving the AIs our alignment homework’). If you believe (2), is this because you’re expecting GDM/the world to dramatically slow down the intelligence explosion for some time to allow humans to keep up, or because you think humans will have an easy time keeping up, or for some other reason?
To be clear, I am not trying to argue here that ‘getting AIs to do our alignment homework’ is a bad plan (I think it is scary as hell but the best course of action likely involves a bunch of it). I’m just trying to articulate why I think that if your plan doesn’t involve that step, then it is likely to be baking in some assumptions that seem dubious to me and are IMO worth stating explicitly.
What links here?
- AI #157: Burn the Boats by Zvi (26 Feb 2026 13:30 UTC; 47 points)

jake_mendel 6 Dec 2025 2:34 UTC
26 points
7
on: An Ambitious Vision for Interpretability
Upvoted. The way I think about the case for ambitious interp is:
1. There are a number of pragmatic approaches in ai safety that bottom out in finding a signal and optimising against it. Such approaches might all have ceilings of usefulness due to oversight problems. In the case of pragmatic interp, the ceiling of usefulness is primarily set by the fact that if you want to use the interp techniques, then you incentivise obfuscation. Of course there are things you can do to try to make the obfuscation happen slower, e.g. hillclimbing on your signal is better than training against it; this is true of most pragmatic approaches.
2. Ambitious interp should have a much higher ceiling of usefulness if we can get it to work in time, because (a) it give us better options for getting a lot of juice out of our signal by understanding what’s going wrong (ie better debugging) and correcting the problem at the root and (b) it may be more robust to obfuscation (even holding constant the way which we optimise against the signal).
3. It still seems plausible that we can get it to work in time, because (a) there are reasonably promising agendas existing in interp, (b) timelines are plausibly more than a few years, and (c) we may be able to automate lots of the work (and work now might still be useful to prep for that).
I do think that pragmatic interp is great, but I don’t want the field to move away from ambitious interp entirely. My guess is that people in favour of moving away from ambitious interp mostly contest (3.a) (and secondarily maybe I think (2) is a bigger deal than they do, although maybe not). I don’t think I would disagree much about how object level promising the existing agendas are: I just think that ‘promising’ should mean ‘could lead somewhere given 10 years of effort, plus maybe lots of AI labour’ which might correspond to research that feels like it is very limited right now.

jake_mendel 27 Nov 2025 22:38 UTC
6 points
4
in reply to: StanislavKrym’s comment on: Alignment remains a hard, unsolved problem
What you described is an example of training based on cognitive oversight in my view. When I said ‘there are definitely ways to do this which lead to less direct pressure for cognitive obfuscation than others’ I was thinking of things like this. There’s a spectrum from ‘just make the rate at which the evilness detector fires a term in the loss function’ to ‘hill climb on your evilness detector (but don’t use it during training)’ to ‘when the evilness detector fires, try to understand mechanistically what went wrong and then modify training to fix the problem’ all the way to ‘when the evilness detector fires, demand a global moratorium, burn all the gpus, and eventually try a new paradigm years later (repeat until you find a paradigm where that paradigm’s version of the evilness detector doesn’t fire)’. in each of these, there is some pressure for cognitive obfuscation, but each approach has less pressure than the earlier ones imo (and it’s ultimately a race between optimistation pressure for alignment and optimisation pressure for obfuscation). I don’t see what is the point of building a cognitive oversight technique if you aren’t going to take actions on the basis of it!

My actual view on what the right amount of optimisation against cognitive oversight is, is a complicated mess with lots of parts: something like have a bunch of techniques, train on ~none of them (unless you have loads and they’re not super correlated), hill climb on some of them, and reserve the highest effort, highest quality ones for sounding the alarm if we aren’t succeeding at aligning the models.

jake_mendel 27 Nov 2025 12:55 UTC
19 points
0
on: Alignment remains a hard, unsolved problem
Really great post, strong upvoted. I was a fan in particular of the selection of research agendas you included at the end.
but there’s no similar fundamental reason that cognitive oversight (e.g. white-box-based oversight like with interpretability) has to get harder with scale
I’m curious why you think this? It seems like there’s some preliminary evidence that models are becoming increasingly capable of manipulating their thoughts. But more importantly, I’m wondering what role you think this cognitive oversight will play in making safe powerful AI? If this oversight will be used as a feedback signal, then I would expect that avoiding obfuscating the cognition does become harder at scale. I do think we should train based on cognitive oversight at least some amount, and there are definitely ways to do this which lead to less direct pressure for cognitive obfuscation than others (eg using white box techniques only in held out alignment evals rather than as part of the loss function), but it still seems like cognitive oversight should predictably get somewhat harder with scale, even if it scales more favourably than behavioural oversight?

Three positive updates I made about technical grantmaking at Coefficient Giving (fka Open Phil)

jake_mendel26 Nov 2025 1:09 UTC

127 points

3 comments6 min readLW link

jake_mendel 31 Oct 2025 17:08 UTC
12 points
0
in reply to: Buck’s comment on: Buck’s Shortform
Copypasting from a slack thread:

I’ll list some work that I think is aspiring to build towards an answer to some of these questions, although lots of it is very toy:
- On generalisation vs simple heuristics:
  - I think the nicest papers here are some toy model interp papers like progress measures for grokking and Grokked Transformers. I think these are two papers which present a pretty crisp distinction between levels of generality of different algorithms that perform similarly on the training set. In modular addition, there are two levels of algorithm and in the Grokked transformer, there are three. The story of what the models end up doing ends up being pretty nuanced and it ends up coming down to specific details of the pre-training data mix, which maybe isn’t that surprising. (If you squint, then you can sort of see predictions in the Grokked transformer being borne out in the nuance about when LLMs can do multi-hop reasoning e.g. Yang et al.) But it seems pretty clear that if the training conditions are right, then you can get increasingly general algorithms learned even when simpler ones would do the trick.
  - I also think a useful idea (although less useful so far than the previous bullet) is about how in certain situations, the way a model implements memorising circuits can sort of naturally become a generalising circuit once you’ve memorised enough. the only concrete example of this that I know of (and it’s not been empirically validated) is the story from computation in superposition of how a head that memorises a lookup table of features to copy, continuously becomes a copying head that generalises to new features once you make the lookup table big enough.
  - These are all toy settings where we can be pretty crisp about what we mean by memorisation and generalisation. I think the picture that we’re beginning to see emerge is that what counts as memorisation and generalisation is very messy and in the weeds and context-specific, but that transformers can generalise in powerful ways if the pre-training mix is right. What “right” means, and what “generalise in powerful ways” means in situations we care about are still unsolved technical questions.
  - Meanwhile, I also think it’s useful to just look at very qualitatively surprising examples of frontier models generalising far even if we can’t be precise about what memorization and generalisation mean in that setting. Papers that I think are especially cool on this axis include emergent misalignment, anti-imitation, OOCR, LLMs are aware of their learned behaviors, LLMs are aware of how they’re being steered (I think it’s an especially interesting and risk-relevant type of generalisation when the policy starts to know something that is only ‘shown’ to the training process). However, I think it’s quite hard to look at these papers and make predictions about future generalisation successes and failures because we don’t have any basic understanding of how to talk about generalisation of these settings.
- On inductive biases and the speed prior:
  - I don’t have much to say about how useful the speed prior is at mitigating scheming, but I think there has been some interesting basic science on what the prior implied by neural network training is in practice. Obviously SLT comes to mind, and some people have tried to claim that SLT suggests that neural network training is actually more like Solomonoff prior than the speed prior (e.g. bushnaq) although I think that work is pretty shaky and may well not hold up.
  - I think something that’s missing from both the speed prior and the Solomonoff prior is a notion of learnability: the reason we have eyes and not cameras is not because eyes have lower K-complexity or lower Kt-complexity than cameras. It’s because there is a curriculum for learning eyes and there (probably) isn’t for cameras; neural network training also requires a training story/learnability. All the work that I know of exploring this is in very toy settings (low hanging fruit prior, leap complexity and the sparse parity problem). I don’t think any of these results are strong enough to make detailed claims about p(deception) yet, and they don’t seem close.
  - OTOH, most learning in the future might be more like current in-context learning, and (very speculatively) it seems possible that in-context learning is more bayesian (less path dependent/learnability dependent) than pre-training. see e.g. riechers et al.
Some random thoughts on what goals powerful AIs will have more generally:
- I think we’ve seen some decent evidence that a lot of training with RL makes models obsessed with completing tasks. I think the main evidence here comes from the reward hacking results, but I also think the Apollo anti-scheming paper is an important result about how strong/robust this phenomenon is. Despite a reasonably concerted effort to train the model to care about something that’s in tension with task completion (being honest), the RLVR instilled such a strong preference/heuristic for task completion that even though the deliberative alignment training process doesn’t reward task completion at all and only rewards honesty (in fact by the end of anti-scheming training the model does the honest thing in every training environment!), the model still ends up wanting to complete tasks enough to deceive, etc in test environments. I don’t think it was super obvious a priori that RL would embed task completion preference that strongly (overriding the human prior).
  - I think there’s some other lessons that we could glean here about stickiness of goals in general. The anti-scheming results suggest to me that something about the quantity and diversity of the RLVR environments internalised task completion preference deeply enough that it was still present after a full training round to disincentivize it. Contrast this with results that show safety training is very shallow and can be destroyed easily (e.g. badllama).
  - Very speculatively, I’m excited about the growing field of teaching models fake facts and trying to work out when they actually believe the fake facts. It seems possible that some techniques and ideas that were developed in order to get models to internalise beliefs deeply (and evaluating success) could be coopted for getting models to internalise preferences/goals deeply (and evaluating success).
- pretty obvious point, but I think the existence of today’s models and the relatively slow progress to human-level intelligence tells us that insofar as future AIs will end up misaligned, their goals will pretty likely be similar to/indistinguishable from human values at a low level of detail, and it’s only when you zoom in that the values would be importantly different from humans’. Of course, this might be enough to kill us. This echoes the sense in which human values are importantly different from inclusive genetic fitness but not that different, and we do still have lots of kids etc. To spell out the idea: Before the AI is smart enough to fully subvert training and guard its goals, we will have lots of ability to shape what goals it ends up with. At some point, if we fail to solve alignment, we will not be able to further refine its goals, but the goals it will end up guarding will be quite related to human goals because it was formed by reward signals that did sort of touch the goal. Again, maybe this is obvious to everyone, but I think it does seem at least to me in contrast with references to squiggles/paperclips that I think are more feasible to end up with if you imagine Brain In A Basement style takeoff.

jake_mendel 25 Sep 2025 1:14 UTC
3 points
0
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
What are solenoidal flux corrections in this context

jake_mendel 6 Aug 2025 11:31 UTC
13 points
3
on: How quick and big would a software intelligence explosion be?
Thanks for this post!

Caveat: I haven’t read this very closely yet, and I’m not an economist. I’m finding it hard to understand why you think it’s reasonable to model an increase in capabilities by an increase in number of parallel copies. That is: in the returns to R&D section, you look at data on how increasing numbers of human-level researchers in AI affect algorithmic progress, but we have ~no data on what happens when you sample researchers from a very different (and superhuman) capability profile. It seems to me entirely plausible that a few months into the intelligence explosion, the best AI researchers are qualitatively superintelligent enough that their research advances per month aren’t the sort of thing that could be done by ~any number of humans^[1] acting in parallel in a month. I acknowledge that this is probably not tractable to model, but that seems like a problem because it seems to me that this qualitative superintelligence is a (maybe the) key driving force of the intelligence explosion.
Some intuition pumps for why this seems reasonably likely:
- My understanding is that historians of science disagree on whether science is driven mostly by a few geniuses or not. It probably varies by discipline, and by how understanding-driven progress is. Compared to other fields in hard STEM, ML is currently probably less understanding-driven right now, but it is still relatively understanding-driven. I think there are good reasons to think that it could plausibly transition to being more understanding driven when the researchers become superhuman, because interp, agent foundations, GOFAI etc haven’t made zero progress and don’t seem fundamentally impossible to me. And if capabilities research becomes loaded on understanding very complicated things, then it could become extremely dependent on quite how capable the most capable researchers are in a way that can’t easily be substituted for by more human-level researchers.
- Suppose I take a smart human and give them the ability/bandwidth to memorise and understand the entire internet. That person would be really different to any normal human, and also really different to any group of humans. So when they try to do research, they approach the tree of ideas to pick the low hanging fruit from a different direction to all of society’s research efforts beforehand, so it seems possible that from their perspective there is a lot of low hanging fruit left on the tree — lots of things that seem easy from their vantage point and nearly impossible to grasp from our perspective^[2]. And, research into how much diminishing returns we’ve seen to ideas in the field is not useful for predicting how much research progress that enhanced human would make in their first year.
  - It seems hard to know quite how many angles of approach there are on the tree of ideas, but it seems possible to me that on more than one occasion when you build a new AI that is now the most intelligent being in the world, it starts doing research and finds many ideas that are easy for it and near impossible for all the beings in the world that came before it.
1. ^
  or at least only by an extremely large number of humans, who are doing something more like brute force search and less like thinking
2. ^
  This is basically the same idea as Dwarkesh’s point that a human-level LLM should be able to make all sorts of new discoveries by connecting dots that humans can’t connect because we can’t read and take in the whole internet.

jake_mendel 17 Jun 2025 5:17 UTC
3 points
0
in reply to: paulfchristiano’s comment on: How do you feel about LessWrong these days? [Open feedback thread]
I think the biggest-alignment relevant update is that I expected RL fine-tuning over longer horizons (or even model-based RL a la AlphaZero) to be a bigger deal. I was really worried about it significantly improving performance and making alignment harder. In 2018-2019 my mainline picture was more like AlphaStar or AlphaZero, with RL fine-tuning being the large majority of compute. I’ve updated about this and definitely acknowledge I was wrong.[3] I don’t think it totally changes the picture though: I’m still scared of RL, I think it is very plausible it will become more important in the future, and think that even the kind of relatively minimal RL we do now can introduce many of the same risks.
Curious to hear how you would revisit this prediction in light of reasoning models? Seems like you weren’t as wrong as you thought a year ago, but maybe you still think there are some key ways your predictions about RL finetuning predictions were off?

jake_mendel 2 May 2025 7:16 UTC
2 points
2
on: Obstacles in ARC’s agenda: Mechanistic Anomaly Detection
This was really useful to read thanks very much for writing these posts!

jake_mendel 30 Apr 2025 11:45 UTC
3 points
1
in reply to: Josh Engels’s comment on: Josh Engels’s Shortform
Very happy you did this!

jake_mendel 21 Mar 2025 16:17 UTC
2 points
0
in reply to: Fabien Roger’s comment on: The case for unlearning that removes information from LLM weights
Do you have any idea about whether the difference between unlearning success on synthetic facts fine-tuned in after pretraining vs real facts introduced during pretraining comes mainly from the ‘synthetic’ part or the ‘fine-tuning’ part? I.e. if you took the synthetic facts dataset and spread it out through the pretraining corpus, do you expect it would be any harder to unlearn the synthetic facts? or maybe this question doesn’t make sense because you’d have to make the dataset much larger or something to get it to learn the facts at all during pretraining? If so, it seems like a pretty interesting research question to try to understand which properties a dataset of synthetic facts needs to have to defeat unlearning.

Research directions Open Phil wants to fund in technical AI safety

jake_mendel, maxnadeau and Peter Favaloro

8 Feb 2025 1:40 UTC

117 points

21 comments58 min readLW link

(www.openphilanthropy.org)

Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas

jake_mendel, maxnadeau and Peter Favaloro

6 Feb 2025 18:58 UTC

111 points

0 comments1 min readLW link

(www.openphilanthropy.org)

Attribution-based parameter decomposition

Lucius Bushnaq, Dan Braun, StefanHex, jake_mendel and Lee Sharkey

25 Jan 2025 13:12 UTC

108 points

21 comments4 min readLW link

(publications.apolloresearch.ai)

jake_mendel 21 Jan 2025 20:57 UTC
4 points
0
in reply to: ryan_greenblatt’s comment on: The Case Against AI Control Research
Fair point. I guess I still want to say that there’s a substantial amount of ‘come up with new research agendas’ (or like sub-agendas) to be done within each of your bullet points, but I agree the focus on getting trustworthy slightly superhuman AIs and then not needing control anymore makes things much better. I also do feel pretty nervous about some of those bullet points as paths to placing so much trust in your AI systems that you don’t feel like you want to bother controlling/monitoring them anymore, and the ones that seem further towards giving me enough trust in the AIs to stop control are also the ones that seem to have the most very open research questions (eg EMs in the extreme case). But I do want to walk back some of the things in my comment above that apply only to aligning very superintelligent AI.

jake_mendel 21 Jan 2025 17:59 UTC
19 points
13
in reply to: Tyler Tracy’s comment on: The Case Against AI Control Research
If you are (1) worried about superintelligence-caused x-risk and (2) have short timelines to both TAI and ASI, it seems like the success or failure of control depends almost entirely on getting the early TAIS to do stuff like “coming up with research agendas”? Like, most people (in AIS) don’t seem to think that unassisted humans are remotely on track to develop alignment techniques that work for very superintelligent AIs within the next 10 years — we don’t really even have any good ideas for how to do that that haven’t been tested. Therefore if we have very superintelligent AIs within the next 10 years (eg 5y till TAI and 5y of RSI), and if we condition on having techniques for aligning them, then it seems very likely that these techniques depend on novel ideas and novel research breakthroughs made by AIs in the period after TAI is developed. It’s possible that most of these breakthroughs are within mechinterp or similar, but that’s a pretty lose constraint, and ‘solve mechinterp’ is really not much more of a narrow, well-scoped goal than ‘solve alignment’. So it seems like optimism about control rests somewhat heavily on optimism that controlled AIs can safely do things like coming up with new research agendas.

jake_mendel 10 Jan 2025 13:39 UTC
4 points
0
in reply to: Nina Panickssery’s comment on: Activation space interpretability may be doomed
[edit: I’m now thinking that actually the optimal probe vector is also orthogonal to $span {{\to v}_{j} | j \neq i}$ so maybe the point doesn’t stand. In general, I think it is probably a mistake to talk about activation vectors as linear combinations of feature vectors, rather than as vectors that can be projected into a set of interpretable readoff directions. see here for more.]
Yes, I’m calling the representation vector the same as the probing vector. Suppose my activation vector can be written as $\to a = \sum_{i} f_{i} {\to v}_{i}$ where $f_{i}$ are feature values and ${\to v}_{i}$ are feature representation vectors. Then the probe vector which minimises MSE (explains most of the variance) is just ${\to v}_{i}$ . To avoid off target effects, the vector ${\to s}_{i}$ you want to steer with for feature $i$ might be the vector that is most ‘surgical’: it only changes the value of this feature and no other features are changed. In that case it should be the vector that lies orthogonal to $span {{\to v}_{j} | j \neq i}$ which is only the same as ${\to v}_{i}$ if the set ${{\to v}_{i}}$ are orthogonal.
Obviously I’m working with a non-overcomplete basis of feature representation vectors here. If we’re dealing with the overcomplete case, then it’s messier. People normally talk about ‘approximately orthogonal vectors’ in which case the most surgical steering vector ${\to s}_{i} \approx {\to v}_{i}$ but (handwaving) you can also talk about something like ‘approximately linearly independent vectors’ in which case my point stands I think (note that SAE decoder directions are definitely not approximately orthogonal). For something less handwavey see this appendix.

jake_mendel 10 Jan 2025 13:31 UTC
6 points
0
on: Activation space interpretability may be doomed
A thought triggered by reading issue 3:
I agree issue 3 seems like a potential problem with methods that optimise for sparsity too much, but it doesn’t seem that directly related to the main thesis? At least in the example you give, it should be possible in principle to notice that the space can be factored as a direct sum without having to look to future layers. I guess what I want to ask here is:
It seems like there is a spectrum of possible views you could have here:
1. It’s achievable to come up with sensible ansatzes (sparsity, linear representations, if we see the possibility to decompose the space into direct sums then we should do that, and so on) which will get us most of the way to finding the ground truth features, but there are edge cases/counterexamples which can only be resolved by looking at how the activation vector is used. this is compatible with the example you gave in issue 3 where the space is factorisable into a direct sum which seems pretty natural/easy to look for in advance, although of course that’s the reason you picked that particular structure as an example.
2. There are many many ways to decompose an activation vector, corresponding to many plausible but mutually incompatible sets of ansatzes, and the only way to know which is correct for the purposes of understanding the model is to see how the activation vector is used in the later layers.
  1. Maybe there are many possible decompositions but they are all/mostly straightforwardly related to each other by eg a sparse basis transformation, so finding any one decomposition is a step in the right direction.
  2. Maybe not that.
3. Any sensible approach to decomposing an activation vector without looking forward to subsequent layers will be actively misleading. The right way to decompose the activation vector can’t be found in isolation with any set of natural ansatzes because the decomposition depends intimately on the way the activation vector is used.
The main strategy being pursued in interpretability today is (insofar as interp is about fully understanding models):
- First decompose each activation vector individually. Then try to integrate the decompositions of different layers together into circuits. This may require merging found features into higher level features, or tweaking the features in some way, or filtering out some features which turn out to be dataset features. (See also superseding vs supplementing superposition).
This approach is betting that the decompositions you get when you take each vector in isolation are a (big) step in the right direction, even if they require modification, which is more compatible with stance (1) and (2a) in the list above. I don’t think your post contains any knockdown arguments that this approach is doomed (do you agree?), but it is maybe suggestive. It would be cool to have some fully reverse engineered toy models where we can study one layer at a time and see what is going on.

jake_mendel 10 Jan 2025 12:49 UTC
10 points
0
on: Activation space interpretability may be doomed
Nice post! Re issue 1, there are a few things that you can do to work out if a representation you have found is a ‘model feature’ or a ‘dataset feature’. You can:
- Check if intervening on the forward pass to modify this feature produces the expected effect on outputs. Caveats:
  - the best vector for probing is not the best vector for steering (in general the inverse of a matrix is not the transpose, and finding a basis of steering vectors from a basis of probe vectors involves inverting the basis matrix)
  - It’s possible that the feature you found is causally upstream of some features the model has learned, and even if the model hasn’t learned this feature, changing it affects things the model is aware of. OTOH, I’m not sure whether I want to say that this feature has not been learned by the model in this case.
  - Some techniques eg crosscoders don’t come equipped with a well defined notion of intervening on the feature during a forward pass.
  Nonetheless, we can still sometimes get evidence this way, in particular about whether our probe has found subtle structure in the data that is really causally irrelevant to the model. This is already a common technique in interpretability (see eg the initimitable golden gate claude, and many more systematic steering tests like this one),
- Run various shuffle/permutation controls:
  - Measure the selectivity of your feature finding technique: replace the structure in the data with some new structure (or just remove the structure) and then see if your probe finds that new structure. To the extent that the probe can learn the new structure, it is not telling you about what the model has learned.
    Most straightforwardly: if you have trained a supervised probe, you can train a second supervised probe on a dataset with randomised labels, and look at how much more accurate the probe is when trained on data with true labels. This can help distinguish between the hypothesis that you have found a real variable in the model, and the null hypothesis that the probing technique is powerful enough to find a direction that can classify any dataset with that accuracy. Selectivity tests should do things like match the bias of the train data (eg if training a probe on a sparsely activating feature, then the value of the feature is almost always zero and that should be preserved in the control).
    You can also test unsupervised techniques like SAEs this way by training them on random sequences of tokens. There’s probably more sophisticated controls that can be introduced here: eg you can try to destroy all the structure in the data and replace it with random structure that is still sparse in the same sense, and so on.
  - In addition to experiments that destroy the probe training data, you can also run experiments that destroy the structure in the model weights. To the extent that the probe works here, it is not telling you about what the model has learned.
    For example, reinitialise the weights of the model, and train the probe/SAE/look at the PCA directions. This is a weak control: a stronger control could do something like reiniatialising the weights of the model that matches the eigenspectrum of each weight matrix to the eigenspectrum of the corresponding matrix in the trained model (to rule out things like the SAE didn’t work in the randomised model because the activation vector is too small etc), although that control is still quite weak.
    This control was used nicely in Towards Monosemanticity here, although I think much more research of this form could be done with SAEs and their cousins.
  - I am told by Adam Shai that in experimental neuroscience, it is something of a sport to come up with better and better controls for testing the hypothesis that you have identified structure. Maybe some of that energy should be imported to interp?
- Probably some other things not on my mind right now??
I am aware that there is less use in being able to identify whether your features are model features or dataset features than there is in having a technique that zero-shot identifies model features only. However, a reliable set of tools for distinguishing what type of feature we have found would give us feedback loops that could help us search for good feature-finding techniques. eg. good controls would give us the freedom to do things like searching over (potentially nonlinear) probe architectures for those with a high accuracy relative to the control (in the absence of the control, searching over architectures would lead us to more and more expressive nonlinear probes that tell us nothing about the model’s computation). I’m curious if this sort of thing would lead us away from treating activation vectors in isolation, as the post argues.

jake_mendel

Three pos­i­tive up­dates I made about tech­ni­cal grant­mak­ing at Coeffi­cient Giv­ing (fka Open Phil)

Re­search di­rec­tions Open Phil wants to fund in tech­ni­cal AI safety

Open Philan­thropy Tech­ni­cal AI Safety RFP - $40M Available Across 21 Re­search Areas

At­tri­bu­tion-based pa­ram­e­ter decomposition

Three positive updates I made about technical grantmaking at Coefficient Giving (fka Open Phil)

Research directions Open Phil wants to fund in technical AI safety

Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas

Attribution-based parameter decomposition