technical AI safety program associate at OpenPhil
jake_mendel
Three positive updates I made about technical grantmaking at Coefficient Giving (fka Open Phil)
Copypasting from a slack thread:
I’ll list some work that I think is aspiring to build towards an answer to some of these questions, although lots of it is very toy:On generalisation vs simple heuristics:
I think the nicest papers here are some toy model interp papers like progress measures for grokking and Grokked Transformers. I think these are two papers which present a pretty crisp distinction between levels of generality of different algorithms that perform similarly on the training set. In modular addition, there are two levels of algorithm and in the Grokked transformer, there are three. The story of what the models end up doing ends up being pretty nuanced and it ends up coming down to specific details of the pre-training data mix, which maybe isn’t that surprising. (If you squint, then you can sort of see predictions in the Grokked transformer being borne out in the nuance about when LLMs can do multi-hop reasoning e.g. Yang et al.) But it seems pretty clear that if the training conditions are right, then you can get increasingly general algorithms learned even when simpler ones would do the trick.
I also think a useful idea (although less useful so far than the previous bullet) is about how in certain situations, the way a model implements memorising circuits can sort of naturally become a generalising circuit once you’ve memorised enough. the only concrete example of this that I know of (and it’s not been empirically validated) is the story from computation in superposition of how a head that memorises a lookup table of features to copy, continuously becomes a copying head that generalises to new features once you make the lookup table big enough.
These are all toy settings where we can be pretty crisp about what we mean by memorisation and generalisation. I think the picture that we’re beginning to see emerge is that what counts as memorisation and generalisation is very messy and in the weeds and context-specific, but that transformers can generalise in powerful ways if the pre-training mix is right. What “right” means, and what “generalise in powerful ways” means in situations we care about are still unsolved technical questions.
Meanwhile, I also think it’s useful to just look at very qualitatively surprising examples of frontier models generalising far even if we can’t be precise about what memorization and generalisation mean in that setting. Papers that I think are especially cool on this axis include emergent misalignment, anti-imitation, OOCR, LLMs are aware of their learned behaviors, LLMs are aware of how they’re being steered (I think it’s an especially interesting and risk-relevant type of generalisation when the policy starts to know something that is only ‘shown’ to the training process). However, I think it’s quite hard to look at these papers and make predictions about future generalisation successes and failures because we don’t have any basic understanding of how to talk about generalisation of these settings.
On inductive biases and the speed prior:
I don’t have much to say about how useful the speed prior is at mitigating scheming, but I think there has been some interesting basic science on what the prior implied by neural network training is in practice. Obviously SLT comes to mind, and some people have tried to claim that SLT suggests that neural network training is actually more like Solomonoff prior than the speed prior (e.g. bushnaq) although I think that work is pretty shaky and may well not hold up.
I think something that’s missing from both the speed prior and the Solomonoff prior is a notion of learnability: the reason we have eyes and not cameras is not because eyes have lower K-complexity or lower Kt-complexity than cameras. It’s because there is a curriculum for learning eyes and there (probably) isn’t for cameras; neural network training also requires a training story/learnability. All the work that I know of exploring this is in very toy settings (low hanging fruit prior, leap complexity and the sparse parity problem). I don’t think any of these results are strong enough to make detailed claims about p(deception) yet, and they don’t seem close.
OTOH, most learning in the future might be more like current in-context learning, and (very speculatively) it seems possible that in-context learning is more bayesian (less path dependent/learnability dependent) than pre-training. see e.g. riechers et al.
Some random thoughts on what goals powerful AIs will have more generally:
I think we’ve seen some decent evidence that a lot of training with RL makes models obsessed with completing tasks. I think the main evidence here comes from the reward hacking results, but I also think the Apollo anti-scheming paper is an important result about how strong/robust this phenomenon is. Despite a reasonably concerted effort to train the model to care about something that’s in tension with task completion (being honest), the RLVR instilled such a strong preference/heuristic for task completion that even though the deliberative alignment training process doesn’t reward task completion at all and only rewards honesty (in fact by the end of anti-scheming training the model does the honest thing in every training environment!), the model still ends up wanting to complete tasks enough to deceive, etc in test environments. I don’t think it was super obvious a priori that RL would embed task completion preference that strongly (overriding the human prior).
I think there’s some other lessons that we could glean here about stickiness of goals in general. The anti-scheming results suggest to me that something about the quantity and diversity of the RLVR environments internalised task completion preference deeply enough that it was still present after a full training round to disincentivize it. Contrast this with results that show safety training is very shallow and can be destroyed easily (e.g. badllama).
Very speculatively, I’m excited about the growing field of teaching models fake facts and trying to work out when they actually believe the fake facts. It seems possible that some techniques and ideas that were developed in order to get models to internalise beliefs deeply (and evaluating success) could be coopted for getting models to internalise preferences/goals deeply (and evaluating success).
pretty obvious point, but I think the existence of today’s models and the relatively slow progress to human-level intelligence tells us that insofar as future AIs will end up misaligned, their goals will pretty likely be similar to/indistinguishable from human values at a low level of detail, and it’s only when you zoom in that the values would be importantly different from humans’. Of course, this might be enough to kill us. This echoes the sense in which human values are importantly different from inclusive genetic fitness but not that different, and we do still have lots of kids etc. To spell out the idea: Before the AI is smart enough to fully subvert training and guard its goals, we will have lots of ability to shape what goals it ends up with. At some point, if we fail to solve alignment, we will not be able to further refine its goals, but the goals it will end up guarding will be quite related to human goals because it was formed by reward signals that did sort of touch the goal. Again, maybe this is obvious to everyone, but I think it does seem at least to me in contrast with references to squiggles/paperclips that I think are more feasible to end up with if you imagine Brain In A Basement style takeoff.
What are solenoidal flux corrections in this context
Thanks for this post!
Caveat: I haven’t read this very closely yet, and I’m not an economist. I’m finding it hard to understand why you think it’s reasonable to model an increase in capabilities by an increase in number of parallel copies. That is: in the returns to R&D section, you look at data on how increasing numbers of human-level researchers in AI affect algorithmic progress, but we have ~no data on what happens when you sample researchers from a very different (and superhuman) capability profile. It seems to me entirely plausible that a few months into the intelligence explosion, the best AI researchers are qualitatively superintelligent enough that their research advances per month aren’t the sort of thing that could be done by ~any number of humans[1] acting in parallel in a month. I acknowledge that this is probably not tractable to model, but that seems like a problem because it seems to me that this qualitative superintelligence is a (maybe the) key driving force of the intelligence explosion.Some intuition pumps for why this seems reasonably likely:
My understanding is that historians of science disagree on whether science is driven mostly by a few geniuses or not. It probably varies by discipline, and by how understanding-driven progress is. Compared to other fields in hard STEM, ML is currently probably less understanding-driven right now, but it is still relatively understanding-driven. I think there are good reasons to think that it could plausibly transition to being more understanding driven when the researchers become superhuman, because interp, agent foundations, GOFAI etc haven’t made zero progress and don’t seem fundamentally impossible to me. And if capabilities research becomes loaded on understanding very complicated things, then it could become extremely dependent on quite how capable the most capable researchers are in a way that can’t easily be substituted for by more human-level researchers.
Suppose I take a smart human and give them the ability/bandwidth to memorise and understand the entire internet. That person would be really different to any normal human, and also really different to any group of humans. So when they try to do research, they approach the tree of ideas to pick the low hanging fruit from a different direction to all of society’s research efforts beforehand, so it seems possible that from their perspective there is a lot of low hanging fruit left on the tree — lots of things that seem easy from their vantage point and nearly impossible to grasp from our perspective[2]. And, research into how much diminishing returns we’ve seen to ideas in the field is not useful for predicting how much research progress that enhanced human would make in their first year.
It seems hard to know quite how many angles of approach there are on the tree of ideas, but it seems possible to me that on more than one occasion when you build a new AI that is now the most intelligent being in the world, it starts doing research and finds many ideas that are easy for it and near impossible for all the beings in the world that came before it.
- ^
or at least only by an extremely large number of humans, who are doing something more like brute force search and less like thinking
- ^
This is basically the same idea as Dwarkesh’s point that a human-level LLM should be able to make all sorts of new discoveries by connecting dots that humans can’t connect because we can’t read and take in the whole internet.
I think the biggest-alignment relevant update is that I expected RL fine-tuning over longer horizons (or even model-based RL a la AlphaZero) to be a bigger deal. I was really worried about it significantly improving performance and making alignment harder. In 2018-2019 my mainline picture was more like AlphaStar or AlphaZero, with RL fine-tuning being the large majority of compute. I’ve updated about this and definitely acknowledge I was wrong.[3] I don’t think it totally changes the picture though: I’m still scared of RL, I think it is very plausible it will become more important in the future, and think that even the kind of relatively minimal RL we do now can introduce many of the same risks.
Curious to hear how you would revisit this prediction in light of reasoning models? Seems like you weren’t as wrong as you thought a year ago, but maybe you still think there are some key ways your predictions about RL finetuning predictions were off?
This was really useful to read thanks very much for writing these posts!
Very happy you did this!
Do you have any idea about whether the difference between unlearning success on synthetic facts fine-tuned in after pretraining vs real facts introduced during pretraining comes mainly from the ‘synthetic’ part or the ‘fine-tuning’ part? I.e. if you took the synthetic facts dataset and spread it out through the pretraining corpus, do you expect it would be any harder to unlearn the synthetic facts? or maybe this question doesn’t make sense because you’d have to make the dataset much larger or something to get it to learn the facts at all during pretraining? If so, it seems like a pretty interesting research question to try to understand which properties a dataset of synthetic facts needs to have to defeat unlearning.
Research directions Open Phil wants to fund in technical AI safety
Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas
Attribution-based parameter decomposition
Fair point. I guess I still want to say that there’s a substantial amount of ‘come up with new research agendas’ (or like sub-agendas) to be done within each of your bullet points, but I agree the focus on getting trustworthy slightly superhuman AIs and then not needing control anymore makes things much better. I also do feel pretty nervous about some of those bullet points as paths to placing so much trust in your AI systems that you don’t feel like you want to bother controlling/monitoring them anymore, and the ones that seem further towards giving me enough trust in the AIs to stop control are also the ones that seem to have the most very open research questions (eg EMs in the extreme case). But I do want to walk back some of the things in my comment above that apply only to aligning very superintelligent AI.
If you are (1) worried about superintelligence-caused x-risk and (2) have short timelines to both TAI and ASI, it seems like the success or failure of control depends almost entirely on getting the early TAIS to do stuff like “coming up with research agendas”? Like, most people (in AIS) don’t seem to think that unassisted humans are remotely on track to develop alignment techniques that work for very superintelligent AIs within the next 10 years — we don’t really even have any good ideas for how to do that that haven’t been tested. Therefore if we have very superintelligent AIs within the next 10 years (eg 5y till TAI and 5y of RSI), and if we condition on having techniques for aligning them, then it seems very likely that these techniques depend on novel ideas and novel research breakthroughs made by AIs in the period after TAI is developed. It’s possible that most of these breakthroughs are within mechinterp or similar, but that’s a pretty lose constraint, and ‘solve mechinterp’ is really not much more of a narrow, well-scoped goal than ‘solve alignment’. So it seems like optimism about control rests somewhat heavily on optimism that controlled AIs can safely do things like coming up with new research agendas.
[edit: I’m now thinking that actually the optimal probe vector is also orthogonal to so maybe the point doesn’t stand. In general, I think it is probably a mistake to talk about activation vectors as linear combinations of feature vectors, rather than as vectors that can be projected into a set of interpretable readoff directions. see here for more.]
Yes, I’m calling the representation vector the same as the probing vector. Suppose my activation vector can be written as where are feature values and are feature representation vectors. Then the probe vector which minimises MSE (explains most of the variance) is just . To avoid off target effects, the vector you want to steer with for feature might be the vector that is most ‘surgical’: it only changes the value of this feature and no other features are changed. In that case it should be the vector that lies orthogonal to which is only the same as if the set are orthogonal.
Obviously I’m working with a non-overcomplete basis of feature representation vectors here. If we’re dealing with the overcomplete case, then it’s messier. People normally talk about ‘approximately orthogonal vectors’ in which case the most surgical steering vector but (handwaving) you can also talk about something like ‘approximately linearly independent vectors’ in which case my point stands I think (note that SAE decoder directions are definitely not approximately orthogonal). For something less handwavey see this appendix.
A thought triggered by reading issue 3:
I agree issue 3 seems like a potential problem with methods that optimise for sparsity too much, but it doesn’t seem that directly related to the main thesis? At least in the example you give, it should be possible in principle to notice that the space can be factored as a direct sum without having to look to future layers. I guess what I want to ask here is:
It seems like there is a spectrum of possible views you could have here:
It’s achievable to come up with sensible ansatzes (sparsity, linear representations, if we see the possibility to decompose the space into direct sums then we should do that, and so on) which will get us most of the way to finding the ground truth features, but there are edge cases/counterexamples which can only be resolved by looking at how the activation vector is used. this is compatible with the example you gave in issue 3 where the space is factorisable into a direct sum which seems pretty natural/easy to look for in advance, although of course that’s the reason you picked that particular structure as an example.
There are many many ways to decompose an activation vector, corresponding to many plausible but mutually incompatible sets of ansatzes, and the only way to know which is correct for the purposes of understanding the model is to see how the activation vector is used in the later layers.
Maybe there are many possible decompositions but they are all/mostly straightforwardly related to each other by eg a sparse basis transformation, so finding any one decomposition is a step in the right direction.
Maybe not that.
Any sensible approach to decomposing an activation vector without looking forward to subsequent layers will be actively misleading. The right way to decompose the activation vector can’t be found in isolation with any set of natural ansatzes because the decomposition depends intimately on the way the activation vector is used.
The main strategy being pursued in interpretability today is (insofar as interp is about fully understanding models):
First decompose each activation vector individually. Then try to integrate the decompositions of different layers together into circuits. This may require merging found features into higher level features, or tweaking the features in some way, or filtering out some features which turn out to be dataset features. (See also superseding vs supplementing superposition).
This approach is betting that the decompositions you get when you take each vector in isolation are a (big) step in the right direction, even if they require modification, which is more compatible with stance (1) and (2a) in the list above. I don’t think your post contains any knockdown arguments that this approach is doomed (do you agree?), but it is maybe suggestive. It would be cool to have some fully reverse engineered toy models where we can study one layer at a time and see what is going on.
Nice post! Re issue 1, there are a few things that you can do to work out if a representation you have found is a ‘model feature’ or a ‘dataset feature’. You can:
Check if intervening on the forward pass to modify this feature produces the expected effect on outputs. Caveats:
the best vector for probing is not the best vector for steering (in general the inverse of a matrix is not the transpose, and finding a basis of steering vectors from a basis of probe vectors involves inverting the basis matrix)
It’s possible that the feature you found is causally upstream of some features the model has learned, and even if the model hasn’t learned this feature, changing it affects things the model is aware of. OTOH, I’m not sure whether I want to say that this feature has not been learned by the model in this case.
Some techniques eg crosscoders don’t come equipped with a well defined notion of intervening on the feature during a forward pass.
Nonetheless, we can still sometimes get evidence this way, in particular about whether our probe has found subtle structure in the data that is really causally irrelevant to the model. This is already a common technique in interpretability (see eg the initimitable golden gate claude, and many more systematic steering tests like this one),
Run various shuffle/permutation controls:
Measure the selectivity of your feature finding technique: replace the structure in the data with some new structure (or just remove the structure) and then see if your probe finds that new structure. To the extent that the probe can learn the new structure, it is not telling you about what the model has learned.
Most straightforwardly: if you have trained a supervised probe, you can train a second supervised probe on a dataset with randomised labels, and look at how much more accurate the probe is when trained on data with true labels. This can help distinguish between the hypothesis that you have found a real variable in the model, and the null hypothesis that the probing technique is powerful enough to find a direction that can classify any dataset with that accuracy. Selectivity tests should do things like match the bias of the train data (eg if training a probe on a sparsely activating feature, then the value of the feature is almost always zero and that should be preserved in the control).
You can also test unsupervised techniques like SAEs this way by training them on random sequences of tokens. There’s probably more sophisticated controls that can be introduced here: eg you can try to destroy all the structure in the data and replace it with random structure that is still sparse in the same sense, and so on.In addition to experiments that destroy the probe training data, you can also run experiments that destroy the structure in the model weights. To the extent that the probe works here, it is not telling you about what the model has learned.
For example, reinitialise the weights of the model, and train the probe/SAE/look at the PCA directions. This is a weak control: a stronger control could do something like reiniatialising the weights of the model that matches the eigenspectrum of each weight matrix to the eigenspectrum of the corresponding matrix in the trained model (to rule out things like the SAE didn’t work in the randomised model because the activation vector is too small etc), although that control is still quite weak.
This control was used nicely in Towards Monosemanticity here, although I think much more research of this form could be done with SAEs and their cousins.I am told by Adam Shai that in experimental neuroscience, it is something of a sport to come up with better and better controls for testing the hypothesis that you have identified structure. Maybe some of that energy should be imported to interp?
Probably some other things not on my mind right now??
I am aware that there is less use in being able to identify whether your features are model features or dataset features than there is in having a technique that zero-shot identifies model features only. However, a reliable set of tools for distinguishing what type of feature we have found would give us feedback loops that could help us search for good feature-finding techniques. eg. good controls would give us the freedom to do things like searching over (potentially nonlinear) probe architectures for those with a high accuracy relative to the control (in the absence of the control, searching over architectures would lead us to more and more expressive nonlinear probes that tell us nothing about the model’s computation). I’m curious if this sort of thing would lead us away from treating activation vectors in isolation, as the post argues.
Strong upvoted. I think the idea in this post could (if interpreted very generously) turn out to be pretty important for making progress at the more ambitious forms of interpretability. If we/the ais are able to pin down more details about what constitutes a valid learning story or a learnable curriculum, and tie that to the way gradient updates can be decomposed into signal on some circuit and noise on the rest of the network, then it seems like we should be able to understand each circuit as it corresponds to the endpoint of a training story, and each part of the training story should correspond to a simple modification of the circuit to add some more complexity. this is potentially better for interpretability than if it were easy for networks to learn huge chunks of structure all at once. How optimistic are you about there being general insights to be had about the structures of learnable curricula and their relation to networks’ internal structure?
I either think this is wrong or I don’t understand.
What do you mean by ‘maximising compounding money?’ Do you mean maximising expected wealth at some specific point in the future? Or median wealth? Are you assuming no time discounting? Or do you mean maximising the expected value of some sort of area under the curve of wealth over time?
I’m not sure I understand your question, but are you asking ‘in what sense are there two networks in series rather than just one deeper network’? The answer to that would be: parts of the inputs to a later small network could come from the outputs of many earlier small networks. Provided the later subnetwork is still sparsely used, it could have a different distribution of when it is used to any particular earlier subnetwork. A classic simple example is how the left-orientation dog detector and the right-orientation dog detector in InceptionV1 fire sort of independently, but both their outputs are inputs to the any-orientation dog detector (which in this case is just computing an OR).
Really great post, strong upvoted. I was a fan in particular of the selection of research agendas you included at the end.
I’m curious why you think this? It seems like there’s some preliminary evidence that models are becoming increasingly capable of manipulating their thoughts. But more importantly, I’m wondering what role you think this cognitive oversight will play in making safe powerful AI? If this oversight will be used as a feedback signal, then I would expect that avoiding obfuscating the cognition does become harder at scale. I do think we should train based on cognitive oversight at least some amount, and there are definitely ways to do this which lead to less direct pressure for cognitive obfuscation than others (eg using white box techniques only in held out alignment evals rather than as part of the loss function), but it still seems like cognitive oversight should predictably get somewhat harder with scale, even if it scales more favourably than behavioural oversight?