I’ve been meaning to write this out properly for almost three years now. Clearly, it’s not going to happen. So, you’re getting an improper quick and hacky version instead.
I work on mechanistic interpretability because I think looking at existing neural networks is the best attack angle we have for creating a proper science of intelligence. I think a good basic grasp of this science is a prerequisite for most of the important research we need to do to align a superintelligence to even get properly started. I view the kind of research I do as somewhat close in kind to what John Wentworth does.
Outer alignment
For example, one problem we have in alignment is that even if we had some way to robustly point a superintelligence at a specific target, we wouldn’t know what to point it at. E.g. famously, we don’t know how to write “make me a copy of a strawberry and don’t destroy the world while you do it” in math. Why don’t we know how to do that?
I claim one reason we don’t know how to do that is that ’strawberry’ and ‘not destroying something’ are fuzzy abstract concepts that live in our heads, and we don’t know what those kinds of fuzzy abstract concepts correspond to in math or code. But GPT-4 clearly understands what a ‘strawberry’ is, at least in some sense. If we understood GPT-4 well enough to not be confused about how it can correctly answer questions about strawberries, maybe we also wouldn’t be quite so confused anymore about what fuzzy abstractions like ‘strawberry’ correspond to in math or code.
Inner alignment
Another problem we have in alignment is that we don’t know how to robustly aim a superintelligence at a specific target. To do that at all, it seems like you might first want to have some notion of what ‘goals’ or ‘desires’ correspond to mechanistically in real agentic-ish minds. I don’t expect this to be as easy as looking for the ‘goal circuits’ in Claude 3.7. My guess is that by default, dumb minds like humans and today’s AIs are too incoherent to have their desires correspond directly to a clear, salient mechanistic structure we can just look at. Instead, I think mapping ‘goals’ and ‘desires’ in the behavioural sense back to the mechanistic properties of the model that cause them might be a whole thing. Understanding the basic mechanisms of the model in isolation mostly only shows you what happens on a single forward pass, while ‘goals’ seem like they’d be more of a many-forward-pass phenomenon. So we might have to tackle a whole second chapter of interpretability there before we get to be much less confused about what goals are.
But this seems like a problem you can only effectively attack after you’ve figured out much more basic things about how minds do reasoning moment-to-moment. Understanding how Claude 3.7 thinks about strawberries on a single forward pass may not be sufficient to understand much about the way its thinking evolves over many forward passes. Famously, just because you know how a program works and can see every function in it with helpful comments attached doesn’t yet mean you can predict much about what the program will do if you run it for a year. But trying to predict what the program will do if you run it for a year without first understanding what the functions in it even do seems almost hopeless. So, we should probably figure out how thinking about strawberries works first.
Understand what confuses us, not enumerate everything
To solve these problems, we don’t need an exact blueprint of all the variables in GPT-4 and their role in the computation. For example, I’d guess that a lot of the bits in the weights of GPT-4 are just taken up by database entries, memorised bigrams and trigrams and stuff like that. We definitely need to figure out how to decompile these things out of the weights. But after we’ve done that and looked at a couple of examples to understand the general pattern of what’s in there, most of it will probably no longer be very relevant for resolving our basic confusion about how GPT-4 can answer questions about strawberries. We do need to understand how the model’s cognition interfaces with its stored knowledge about the world. But we don’t need to know most of the details of that world knowledge. Instead, what we really need to understand about GPT-4 are the parts of it that aren’t just trigrams and databases and addition algorithms and basic induction heads and other stuff we already know how to do.
Understanding what’s going on is also just good in general
People argue a lot about whether RLHF or Constitutional AI or whatnot would work to align a superintelligence. I think those arguments would be much more productive and comprehensible to outsiders[1] if the arguers agreed on what exactly those techniques actually do to the insides of current models. Maybe then, those discussions wouldn’t get stuck on debating philosophy so much.
And sure, yes, in the shorter term, understanding how models work can also help make techniques that more robustly detect whether a model is deceiving you in some way, or whatever.
Status?
Compared to the magnitude of the task in front of us, we haven’t gotten much done yet. Though the total number of smart people hours sunk into this is also still very small, by the standards of a normal scientific field. I think we’re doing very well on insights gained per smart person hour invested, compared to a normal field, and very badly on finishing up before our deadline.
I hope that as we understand the neural networks in front of us more, we’ll get more general insights like that, insights that say something about how most computationally efficient minds may work, not just our current neural networks. If we manage to get enough insights like this, I think they could form a science of minds on the back of which we could build a science of alignment. And then maybe we could do something as complicated and precise as aligning a superintelligence on the first try.
The LIGO gravitational wave detector probably had to work right on the first build, or they’d have wasted a billion dollars. It’s not like they could’ve built a smaller detector first to test the idea, not on a real gravitational wave. So, getting complicated things in a new domain right on the first critical try does seem doable for humans, if we understand the subject matter to the level we understand things like general relativity and laser physics. That kind of understanding is what I aim to get for minds.
At present, it doesn’t seem to me like we’ll have time to finish that project. So, I think humanity should probably try to buy more time somehow.
“The LIGO gravitational wave detector probably had to work right on the first build, or they’d have wasted a billion dollars. It’s not like they could’ve built a smaller detector first to test the idea, not on a real gravitational wave.”
LIGO did not work right on the first build. The original LIGO ran from 2002 to 2010 and detected nothing. They hoped it would be sensitive enough to detect gravitational waves, but it wasn’t. Instead, they learned about the noise sources they would have to deal with, which helped them construct a better detector that was able to do the job. So this really isn’t a good example to support the point you’re making.
I think you’d be hard-pressed to get a scientist to admit that the money was lost. ;)
Honestly, it’s not obvious that it would have been possible to do Advanced LIGO without the experience from the initial run, which is kind of the point I was making: we don’t usually have tasks that humanity needs to get right on the first try, to the contrary humanity usually needs to fail a few times first!
But the initial budget was around $400 million, the upgrade took another $200 million. I don’t know how much was spent operating the experiment in its initial run, which I guess would be the cleanest proxy for money “wasted”, if you’re imagining a counterfactual where they got it right on the first try.
Nice, I was going to write more or less exactly this post. I agree with everything in it, and this is the primary reason I’m interested in mechinterp.
Basically “all” the concepts that are relevant to safely building an ASI are fuzzy in the way you described. What the AI “values”, corrigibility, deception, instrumental convergence, the degree to which the AI is doing world-modeling and so on.
If we had a complete science of mechanistic interpretability, I think a lot of the problems would become very easy. “Locate the human flourishing concept in the AIs world model and jack that into the desire circuit. Afterwards, find the deception feature and the power-seeking feature and turn them to zero just to be sure.” (this is an exaggeration)
Even if we understood the circuitry underlying the “values” of the AI quite well, that doesn’t automatically let us extrapolate the values of the AI super OOD.
Even if we find that, “Yes boss, the human flourishing thing is correctly plugged into the desire thing, its a good LLM sir”, subtle differences in the human flourishing concept could really really fuck us over as the AGI recursively self-improves into an ASI and optimizes the galaxy.
But, if we can use this to make the AI somewhat corrigible, which, idk, might be possible, I’m not 100% sure, maybe we could sidestep some of these issues.
I claim one reason we don’t know how to do that is that ’strawberry’ and ‘not destroying something’ are fuzzy abstract concepts that live in our heads
rather than
I claim the reason we don’t know how to do that is that ’strawberry’ and ‘not destroying something’ are fuzzy abstract concepts that live in our heads
My claim here is that good mech interp helps you be less confused about outer alignment[1], not that what I’ve sketched here suffices to solve outer alignment.
Well, my model is that the primary reason we’re unable to deal with deceptive alignment or goal misgeneralization is because we’re confused, but that the reason we don’t have a solution to Outer Alignment is because its just cursed and a hard problem.
I recall a solution to the outer alignment problem as ‘minimise the amount of options you deny to other agents in the world’, which is a more tractable version of ‘mimimise net long term changes to the world’. There is an article explaining this somewhere.
My theory of impact for interpretability:
I’ve been meaning to write this out properly for almost three years now. Clearly, it’s not going to happen. So, you’re getting an improper quick and hacky version instead.
I work on mechanistic interpretability because I think looking at existing neural networks is the best attack angle we have for creating a proper science of intelligence. I think a good basic grasp of this science is a prerequisite for most of the important research we need to do to align a superintelligence to even get properly started. I view the kind of research I do as somewhat close in kind to what John Wentworth does.
Outer alignment
For example, one problem we have in alignment is that even if we had some way to robustly point a superintelligence at a specific target, we wouldn’t know what to point it at. E.g. famously, we don’t know how to write “make me a copy of a strawberry and don’t destroy the world while you do it” in math. Why don’t we know how to do that?
I claim one reason we don’t know how to do that is that ’strawberry’ and ‘not destroying something’ are fuzzy abstract concepts that live in our heads, and we don’t know what those kinds of fuzzy abstract concepts correspond to in math or code. But GPT-4 clearly understands what a ‘strawberry’ is, at least in some sense. If we understood GPT-4 well enough to not be confused about how it can correctly answer questions about strawberries, maybe we also wouldn’t be quite so confused anymore about what fuzzy abstractions like ‘strawberry’ correspond to in math or code.
Inner alignment
Another problem we have in alignment is that we don’t know how to robustly aim a superintelligence at a specific target. To do that at all, it seems like you might first want to have some notion of what ‘goals’ or ‘desires’ correspond to mechanistically in real agentic-ish minds. I don’t expect this to be as easy as looking for the ‘goal circuits’ in Claude 3.7. My guess is that by default, dumb minds like humans and today’s AIs are too incoherent to have their desires correspond directly to a clear, salient mechanistic structure we can just look at. Instead, I think mapping ‘goals’ and ‘desires’ in the behavioural sense back to the mechanistic properties of the model that cause them might be a whole thing. Understanding the basic mechanisms of the model in isolation mostly only shows you what happens on a single forward pass, while ‘goals’ seem like they’d be more of a many-forward-pass phenomenon. So we might have to tackle a whole second chapter of interpretability there before we get to be much less confused about what goals are.
But this seems like a problem you can only effectively attack after you’ve figured out much more basic things about how minds do reasoning moment-to-moment. Understanding how Claude 3.7 thinks about strawberries on a single forward pass may not be sufficient to understand much about the way its thinking evolves over many forward passes. Famously, just because you know how a program works and can see every function in it with helpful comments attached doesn’t yet mean you can predict much about what the program will do if you run it for a year. But trying to predict what the program will do if you run it for a year without first understanding what the functions in it even do seems almost hopeless. So, we should probably figure out how thinking about strawberries works first.
Understand what confuses us, not enumerate everything
To solve these problems, we don’t need an exact blueprint of all the variables in GPT-4 and their role in the computation. For example, I’d guess that a lot of the bits in the weights of GPT-4 are just taken up by database entries, memorised bigrams and trigrams and stuff like that. We definitely need to figure out how to decompile these things out of the weights. But after we’ve done that and looked at a couple of examples to understand the general pattern of what’s in there, most of it will probably no longer be very relevant for resolving our basic confusion about how GPT-4 can answer questions about strawberries. We do need to understand how the model’s cognition interfaces with its stored knowledge about the world. But we don’t need to know most of the details of that world knowledge. Instead, what we really need to understand about GPT-4 are the parts of it that aren’t just trigrams and databases and addition algorithms and basic induction heads and other stuff we already know how to do.
AI engineers in the year 2006 knew how to write a big database, and they knew how to do a vector search. But they didn’t know how to write programs that could talk, or understand what strawberries are, in any meaningful sense. GPT-4 can talk, and it clearly understands what a strawberry is in some meaningful sense. So something is going on in GPT-4 that AI engineers in the year 2006 didn’t already know about. That is what we need to understand if we want to know how it can do basic abstract reasoning.
Understanding what’s going on is also just good in general
People argue a lot about whether RLHF or Constitutional AI or whatnot would work to align a superintelligence. I think those arguments would be much more productive and comprehensible to outsiders[1] if the arguers agreed on what exactly those techniques actually do to the insides of current models. Maybe then, those discussions wouldn’t get stuck on debating philosophy so much.
And sure, yes, in the shorter term, understanding how models work can also help make techniques that more robustly detect whether a model is deceiving you in some way, or whatever.
Status?
Compared to the magnitude of the task in front of us, we haven’t gotten much done yet. Though the total number of smart people hours sunk into this is also still very small, by the standards of a normal scientific field. I think we’re doing very well on insights gained per smart person hour invested, compared to a normal field, and very badly on finishing up before our deadline.
But at least, poking at things that confused me about current deep learning systems has already helped me become somewhat less confused about how minds in general could work. I used to have no idea how any general reasoner in the real world could tractably favour simple hypotheses over complex ones, given that calculating the minimum description length of a hypothesis is famously very computationally difficult. Now, I’m not so confused about that anymore.
I hope that as we understand the neural networks in front of us more, we’ll get more general insights like that, insights that say something about how most computationally efficient minds may work, not just our current neural networks. If we manage to get enough insights like this, I think they could form a science of minds on the back of which we could build a science of alignment. And then maybe we could do something as complicated and precise as aligning a superintelligence on the first try.
The LIGO gravitational wave detector probably had to work right on the first build, or they’d have wasted a billion dollars. It’s not like they could’ve built a smaller detector first to test the idea, not on a real gravitational wave. So, getting complicated things in a new domain right on the first critical try does seem doable for humans, if we understand the subject matter to the level we understand things like general relativity and laser physics. That kind of understanding is what I aim to get for minds.
At present, it doesn’t seem to me like we’ll have time to finish that project. So, I think humanity should probably try to buy more time somehow.
Like, say, politicians. Or natsec people.
I signed up just to comment on this:
“The LIGO gravitational wave detector probably had to work right on the first build, or they’d have wasted a billion dollars. It’s not like they could’ve built a smaller detector first to test the idea, not on a real gravitational wave.”
LIGO did not work right on the first build. The original LIGO ran from 2002 to 2010 and detected nothing. They hoped it would be sensitive enough to detect gravitational waves, but it wasn’t. Instead, they learned about the noise sources they would have to deal with, which helped them construct a better detector that was able to do the job. So this really isn’t a good example to support the point you’re making.
How much money would you guess was lost on this?
I think you’d be hard-pressed to get a scientist to admit that the money was lost. ;)
Honestly, it’s not obvious that it would have been possible to do Advanced LIGO without the experience from the initial run, which is kind of the point I was making: we don’t usually have tasks that humanity needs to get right on the first try, to the contrary humanity usually needs to fail a few times first!
But the initial budget was around $400 million, the upgrade took another $200 million. I don’t know how much was spent operating the experiment in its initial run, which I guess would be the cleanest proxy for money “wasted”, if you’re imagining a counterfactual where they got it right on the first try.
Nice, I was going to write more or less exactly this post. I agree with everything in it, and this is the primary reason I’m interested in mechinterp.
Basically “all” the concepts that are relevant to safely building an ASI are fuzzy in the way you described. What the AI “values”, corrigibility, deception, instrumental convergence, the degree to which the AI is doing world-modeling and so on.
If we had a complete science of mechanistic interpretability, I think a lot of the problems would become very easy. “Locate the human flourishing concept in the AIs world model and jack that into the desire circuit. Afterwards, find the deception feature and the power-seeking feature and turn them to zero just to be sure.” (this is an exaggeration)
The only thing I disagree with is the Outer Misalignment paragrpah. Outer Misalignment seems like one of the issues that wouldn’t be solved. Largely due to goodhearts curse type stuff. This article by scott explains my hypothetical remaining worries well https://slatestarcodex.com/2018/09/25/the-tails-coming-apart-as-metaphor-for-life/
Even if we understood the circuitry underlying the “values” of the AI quite well, that doesn’t automatically let us extrapolate the values of the AI super OOD.
Even if we find that, “Yes boss, the human flourishing thing is correctly plugged into the desire thing, its a good LLM sir”, subtle differences in the human flourishing concept could really really fuck us over as the AGI recursively self-improves into an ASI and optimizes the galaxy.
But, if we can use this to make the AI somewhat corrigible, which, idk, might be possible, I’m not 100% sure, maybe we could sidestep some of these issues.
Any thoughts about this?
There is a reason that paragraph says
rather than
My claim here is that good mech interp helps you be less confused about outer alignment[1], not that what I’ve sketched here suffices to solve outer alignment.
Outer alignment in the wider sense of ‘the problem of figuring out what target to point the AI at’.
Well, my model is that the primary reason we’re unable to deal with deceptive alignment or goal misgeneralization is because we’re confused, but that the reason we don’t have a solution to Outer Alignment is because its just cursed and a hard problem.
I recall a solution to the outer alignment problem as ‘minimise the amount of options you deny to other agents in the world’, which is a more tractable version of ‘mimimise net long term changes to the world’. There is an article explaining this somewhere.