Thanks for writing this! I’m quite excited by learning more about your meta-agenda and your research process, and this reading stimulated me about my own research process.
But it feels to me like egregious misalignment is an extreme and somewhat strange failure mode and it should be possible to avoid it regardless of how the empirical facts shake out.
So you don’t think that we could have a result of the sort “with these empirical facts, egregious misalignment is either certain or very hard to defend against, and so we should push towards not building AIs that way”? Or is it more than even with such arguments, you see incentives for people to use it, and so we might as well consider that we have to solve the problem even in such problematic cases?
This is a much higher bar for an algorithm to meet, so it may just be an impossible task. But if it’s possible, there are several ways in which it could actually be easier:
We can potentially iterate much faster, since it’s often easier to think of a single story about how an algorithm can fail than it is to characterize its behavior in practice.
We can spend a lot of our time working with simple or extreme toy cases that are easier to reason about, since our algorithm is supposed to work even in these cases.
We can find algorithms that have a good chance of working in the future even if we don’t know what AI will look like or how quickly it will advance, since we’ve been thinking about a very wide range of possible failure cases.
Of these, only the last one looks to me like it’s making things simpler. The first seems misleading: what we need is a universal quantification over plausible stories, which I would guess requires understanding the behavior. Or said differently, if you have to solve every plausible scenario, then simple testing doesn’t cut it. And for the second, my personal worry with work on toy models is that the solutions work on test cases but not on practical one, not the other way around.
I’d guess there’s a 25–50% chance that we can find an alignment strategy that looks like it works, in the sense that we can’t come up with a plausible story about how it leads to egregious misalignment. That’s a high enough probability that I’m very excited to gamble on it. Moreover, if it fails I think we’re likely to identify some possible “hard cases” for alignment — simple situations where egregious misalignment feels inevitable.
Reading that paragraph, I feel like you addressed some of my questions from above. One thing that I only understood here is that you want a solution such that we can’t think of a plausible scenario where it leads to egregious misalignment, not a solution such that there isn’t any such plausible scenario. I guess your reasons here are basically the same as the ones for using ascription universality with regard to a human’s epistemic perspective.
What this looks like (3 examples)
Your rundown of examples from your research was really helpful, not only to get a grip of the process, but also because it clarified the path of refinement of your different proposals. I think it might be worth to make it its own post, with maybe more examples, for a view of how your “stable” evolved over the years.
My research basically involves alternating between “think of a plausible alignment algorithm” and “think of a plausible story about how it fails.”
This made me think of this famous paper in the theory of distributed computing, and especially what Nancy Lynch, the author, says about the process of working on impossibility results:
How does one go about working on an impossibility proof? [...]
Then it’s time to begin the game of playing the positive and negative directions of a proof against each other. My colleagues and I have often worked alternatively on one direction and on the other, in each case until we got stuck. It is not a good idea to work just on an impossibility result, because there is always the unfortunate possibility that the task you are trying to prove is impossible is in fact possible, and some algorithm may surface.
I’m always thinking about a stable of possible alignment strategies and possible stories about how each strategy can fail. Depending on the current state of play, there are a bunch of different things to do:
I expect this description of the process to be really helpful to many starting researchers who don’t know where to push when one direction or approach fails.
I think there’s a reasonable chance of empirical work turning up unknown unknowns that change how we think about alignment, or to find empirical facts that make alignment easier. We want to get those sooner rather than later.
This is the main reason I’m excited by empirical work.
For the objections and your response, I don’t have any specific comment, except that I pretty much agree with most of what you say. On the differences with traditional theoretical computer science, I feel like the biggest one right now is that most of the work here lies in the “grasping towards the precise problem” instead of “solving a well-defined precise problem”. I would expect that this is because the problem is harder, because the field is younger and has less theoretical work on, and because we are not satisfied by simply working on a tractable and/or exciting precise problem—it has to be relevant to alignment.
The first seems misleading: what we need is a universal quantification over plausible stories, which I would guess requires understanding the behavior.
You get to iterate fast until you find an algorithm where it’s hard to think of failure stories. And you get to work on toy cases until you find an algorithm that actually works in all the toy cases. I think we’re a long way from meeting those bars, so that we’ll get to iterate fast for a while. After we meet those bars, it’s an open question how close we’d be to something that actually works. My suspicion is that we’d have the right basic shape of an algorithm (especially if we are good at thinking of possible failures).
One thing that I only understood here is that you want a solution such that we can’t think of a plausible scenario where it leads to egregious misalignment, not a solution such that there isn’t any such plausible scenario. I guess your reasons here are basically the same as the ones for using ascription universality with regard to a human’s epistemic perspective.
I feel like these distinctions aren’t important until we get to an algorithm for which we can’t think of a failure story (which feels a long way off). At that point the game kind of flips around, and we try to come up with a good story for why it’s impossible to come up with a failure story. Maybe that gives you a strong security argument. If not, then you have to keep trying on one side or the other, though I think you should definitely be starting to prioritize applied work more.
Thanks for writing this! I’m quite excited by learning more about your meta-agenda and your research process, and this reading stimulated me about my own research process.
So you don’t think that we could have a result of the sort “with these empirical facts, egregious misalignment is either certain or very hard to defend against, and so we should push towards not building AIs that way”? Or is it more than even with such arguments, you see incentives for people to use it, and so we might as well consider that we have to solve the problem even in such problematic cases?
Of these, only the last one looks to me like it’s making things simpler. The first seems misleading: what we need is a universal quantification over plausible stories, which I would guess requires understanding the behavior. Or said differently, if you have to solve every plausible scenario, then simple testing doesn’t cut it. And for the second, my personal worry with work on toy models is that the solutions work on test cases but not on practical one, not the other way around.
Reading that paragraph, I feel like you addressed some of my questions from above. One thing that I only understood here is that you want a solution such that we can’t think of a plausible scenario where it leads to egregious misalignment, not a solution such that there isn’t any such plausible scenario. I guess your reasons here are basically the same as the ones for using ascription universality with regard to a human’s epistemic perspective.
Your rundown of examples from your research was really helpful, not only to get a grip of the process, but also because it clarified the path of refinement of your different proposals. I think it might be worth to make it its own post, with maybe more examples, for a view of how your “stable” evolved over the years.
This made me think of this famous paper in the theory of distributed computing, and especially what Nancy Lynch, the author, says about the process of working on impossibility results:
I expect this description of the process to be really helpful to many starting researchers who don’t know where to push when one direction or approach fails.
This is the main reason I’m excited by empirical work.
For the objections and your response, I don’t have any specific comment, except that I pretty much agree with most of what you say. On the differences with traditional theoretical computer science, I feel like the biggest one right now is that most of the work here lies in the “grasping towards the precise problem” instead of “solving a well-defined precise problem”. I would expect that this is because the problem is harder, because the field is younger and has less theoretical work on, and because we are not satisfied by simply working on a tractable and/or exciting precise problem—it has to be relevant to alignment.
You get to iterate fast until you find an algorithm where it’s hard to think of failure stories. And you get to work on toy cases until you find an algorithm that actually works in all the toy cases. I think we’re a long way from meeting those bars, so that we’ll get to iterate fast for a while. After we meet those bars, it’s an open question how close we’d be to something that actually works. My suspicion is that we’d have the right basic shape of an algorithm (especially if we are good at thinking of possible failures).
I feel like these distinctions aren’t important until we get to an algorithm for which we can’t think of a failure story (which feels a long way off). At that point the game kind of flips around, and we try to come up with a good story for why it’s impossible to come up with a failure story. Maybe that gives you a strong security argument. If not, then you have to keep trying on one side or the other, though I think you should definitely be starting to prioritize applied work more.