Conversation with Paul Christiano

Link post

AI Impacts talked to AI safety researcher Paul Christiano about his views on AI risk. With his permission, we have transcribed this interview.

Participants

Paul Christiano — OpenAI safety team
Asya Bergal – AI Impacts
Ronny Fernandez – AI Impacts
Robert Long – AI Impacts

Summary

We spoke with Paul Christiano on August 13, 2019. Here is a brief summary of that conversation:

AI safety is worth working on because AI poses a large risk and AI safety is neglected, and tractable.
Christiano is more optimistic about the likely social consequences of advanced AI than some others in AI safety, in particular researchers at the Machine Intelligence Research Institute (MIRI), for the following reasons:
- The prior on any given problem reducing the expected value of the future by 10% should be low.
- There are several ‘saving throws’–ways in which, even if one thing turns out badly, something else can turn out well, such that AI is not catastrophic.
- Many algorithmic problems are either solvable within 100 years, or provably impossible; this inclines Christiano to think that AI safety problems are reasonably likely to be easy.
- MIRI thinks success is guaranteeing that unaligned intelligences are never created, whereas Christiano just wants to leave the next generation of intelligences in at least as good of a place as humans were when building them.
- ‘Prosaic AI’ that looks like current AI systems will be less hard to align than MIRI thinks:
  - Christiano thinks there’s at least a one-in-three chance that we’ll be able to solve AI safety on paper in advance.
  - A common view within ML is that that we’ll successfully solve problems as they come up.
- Christiano has relatively less confidence in several inside view arguments for high levels of risk:
  - Building safe AI requires hitting a small target in the space of programs, but building any AI also requires hitting a small target.
  - Because Christiano thinks that the state of evidence is less clear-cut than MIRI does, Christiano also has a higher probability that people will become more worried in the future.
  - Just because we haven’t solved many problems in AI safety yet doesn’t mean they’re intractably hard– many technical problems feel this way and then get solved in 10 years of effort.
  - Evolution is often used as an analogy to argue that general intelligence (humans with their own goals) becomes dangerously unaligned with the goals of the outer optimizer (evolution selecting for reproductive fitness). But this analogy doesn’t make Christiano feel so pessimistic, e.g. he thinks that if we tried, we could breed animals that are somewhat smarter than humans and are also friendly and docile.
  - Christiano is optimistic about verification, interpretability, and adversarial training for inner alignment, whereas MIRI is pessimistic.
  - MIRI thinks the outer alignment approaches Christiano proposes are just obscuring the core difficulties of alignment, while Christiano is not yet convinced there is a deep core difficulty.
Christiano thinks there are several things that could change his mind and optimism levels, including:
- Learning about institutions and observing how they solve problems analogous to AI safety.
- Seeing whether AIs become deceptive and how they respond to simple oversight.
- Seeing how much progress we make on AI alignment over the coming years.
Christiano is relatively optimistic about his iterated amplification approach:
- Christiano cares more about making aligned AIs that are competitive with unaligned AIs, whereas MIRI is more willing to settle for an AI with very narrow capabilities.
- Iterated amplification is largely based on learning-based AI systems, though it may work in other cases.
- Even if iterated amplification isn’t the answer to AI safety, it’s likely to have subproblems in common with problems that are important in the future.
There are still many disagreements between Christiano and the Machine Intelligence Research Institute (MIRI) that are messy and haven’t been made precise.

This transcript has been lightly edited for concision and clarity.

Transcript

Asya Bergal: Okay. We are recording. I’m going to ask you a bunch of questions related to something like AI optimism.

I guess the proposition that we’re looking at is something like ‘is it valuable for people to be spending significant effort doing work that purports to reduce the risk from advanced artificial intelligence’? The first question would be to give a short-ish version of the reasoning around that.

Paul Christiano: Around why it’s overall valuable?

Asya Bergal: Yeah. Or the extent to which you think it’s valuable.

Paul Christiano: I don’t know, this seems complicated. I’m acting from some longtermerist perspective, I’m like, what can make the world irreversibly worse? There aren’t that many things, we go extinct. It’s hard to go extinct, doesn’t seem that likely.

Robert Long: We keep forgetting to say this, but we are focusing less on ethical considerations that might affect that. We’ll grant…yeah, with all that in the background….

Paul Christiano: Granting long-termism, but then it seems like it depends a lot on what’s the probability? What fraction of our expected future do we lose by virtue of messing up alignment * what’s the elasticity of that to effort / how much effort?

Robert Long: That’s the stuff we’re curious to see what people think about.

Asya Bergal: I also just read your 80K interview, which I think probably covered like a lot of the reasoning about this.

Paul Christiano: They probably did. I don’t remember exactly what’s in there, but it was a lot of words.

I don’t know. I’m like, it’s a lot of doom probability. Like maybe I think AI alignment per se is like 10% doominess. That’s a lot. Then it seems like if we understood everything in advance really well, or just having a bunch of people working on now understanding what’s up, could easily reduce that by a big chunk.

Ronny Fernandez: Sorry, what do you mean by 10% doominesss?

Paul Christiano: I don’t know, the future is 10% worse than it would otherwise be in expectation by virtue of our failure to align AI. I made up 10%, it’s kind of a random number. I don’t know, it’s less than 50%. It’s more than 10% conditioned on AI soon I think.

Ronny Fernandez: And that’s change in expected value.

Paul Christiano: Yeah. Anyway, so 10% is a lot. Then I’m like, maybe if we sorted all our shit out and had a bunch of people who knew what was up, and had a good theoretical picture of what was up, and had more info available about whether it was a real problem. Maybe really nailing all that could cut that risk from 10% to 5% and maybe like, you know, there aren’t that many people who work on it, it seems like a marginal person can easily do a thousandth of that 5% change. Now you’re looking at one in 20,000 or something, which is a good deal.

Asya Bergal: I think my impression is that that 10% is lower than some large set of people. I don’t know if other people agree with that.

Paul Christiano: Certainly, 10% is lower than lots of people who care about AI risk. I mean it’s worth saying, that I have this slightly narrow conception of what is the alignment problem. I’m not including all AI risk in the 10%. I’m not including in some sense most of the things people normally worry about and just including the like ‘we tried to build an AI that was doing what we want but then it wasn’t even trying to do what we want’. I think it’s lower now or even after that caveat, than pessimistic people. It’s going to be lower than all the MIRI folks, it’s going to be higher than almost everyone in the world at large, especially after specializing in this problem, which is a problem almost no one cares about, which is precisely how a thousand full time people for 20 years can reduce the whole risk by half or something.

Asya Bergal: I’m curious for your statement as to why you think your number is slightly lower than other people.

Paul Christiano: Yeah, I don’t know if I have a particularly crisp answer. Seems like it’s a more reactive thing of like, what are the arguments that it’s very doomy? A priori you might’ve been like, well, if you’re going to build some AI, you’re probably going to build the AI so it’s trying to do what you want it to do. Probably that’s that. Plus, most things can’t destroy the expected value of the future by 10%. You just can’t have that many things, otherwise there’s not going to be any value left in the end. In particular, if you had 100 such things, then you’d be down to like 1/1000th of your values. ¹⁄₁₀ hundred thousandth? I don’t know, I’m not good at arithmetic.

Anyway, that’s a priori, just aren’t that many things are that bad and it seems like people would try and make AI that’s trying to do what they want. Then you’re like, okay, we get to be pessimistic because of some other argument about like, well, we don’t currently know how to build an AI which will do what we want. We’re like, there’s some extrapolation of current techniques on which we’re concerned that we wouldn’t be able to. Or maybe some more conceptual or intuitive argument about why AI is a scary kind thing, and AIs tend to want to do random shit.

Then like, I don’t know, now we get into, how strong is that argument for doominess? Then a major thing that drives it is I am like, reasonable chance there is no problem in fact. Reasonable chance, if there is a problem we can cope with it just by trying. Reasonable chance, even if it will be hard to cope with, we can sort shit out well enough on paper that we really nail it and understand how to resolve it. Reasonable chance, if we don’t solve it the people will just not build AIs that destroy everything they value.

It’s lots of saving throws, you know? And you multiply the saving throws together and things look better. And they interact better than that because– well, in one way worse because it’s correlated: If you’re incompetent, you’re more likely to fail to solve the problem and more likely to fail to coordinate not to destroy the world. In some other sense, it’s better than interacting multiplicatively because weakness in one area compensates for strength in the other. I think there are a bunch of saving throws that could independently make things good, but then in reality you have to have a little bit here and a little bit here and a little bit here, if that makes sense. We have some reasonable understanding on paper that makes the problem easier. The problem wasn’t that bad. We wing it reasonably well and we do a bunch of work and in fact people are just like, ‘Okay, we’re not going to destroy the world given the choice.’ I guess I have this somewhat distinctive last saving throw where I’m like, ‘Even if you have unaligned AI, it’s probably not that bad.’

That doesn’t do much of the work, but you know you add a bunch of shit like that together.

Asya Bergal: That’s a lot of probability mass on a lot of different things. I do feel like my impression is that, on the first step of whether by default things are likely to be okay or things are likely to be good, people make arguments of the form, ‘You have a thing with a goal and it’s so hard to specify. By default, you should assume that the space of possible goals to specify is big, and the one right goal is hard to specify, hard to find.’ Obviously, this is modeling the thing as an agent, which is already an assumption.

Paul Christiano: Yeah. I mean it’s hard to run or have much confidence in arguments of that form. I think it’s possible to run tight versions of that argument that are suggestive. It’s hard to have much confidence in part because you’re like, look, the space of all programs is very broad, and the space that do your taxes is quite small, and we in fact are doing a lot of selecting from the vast space of programs to find one that does your taxes– so like, you’ve already done a lot of that.

And then you have to be getting into more detailed arguments about exactly how hard is it to select. I think there’s two kinds of arguments you can make that are different, or which I separate. One is the inner alignment treacherous turney argument, where like, we can’t tell the difference between AIs that are doing the right and wrong thing, even if you know what’s right because blah blah blah. The other is well, you don’t have this test for ‘was it right’ and so you can’t be selecting for ‘does the right thing’.

This is a place where the concern is disjunctive, you have like two different things, they’re both sitting in your alignment problem. They can again interact badly. But like, I don’t know, I don’t think you’re going to get to high probabilities from this. I think I would kind of be at like, well I don’t know. Maybe I think it’s more likely than not that there’s a real problem but not like 90%, you know? Like maybe I’m like two to one that there exists a non-trivial problem or something like that. All of the numbers I’m going to give are very made up though. If you asked me a second time you’ll get all different numbers.

Asya Bergal: That’s good to know.

Paul Christiano: Sometimes I anchor on past things I’ve said though, unfortunately.

Asya Bergal: Okay. Maybe I should give you some fake past Paul numbers.

Paul Christiano: You could be like, ‘In that interview, you said that it was 85%’. I’d be like, ‘I think it’s really probably 82%’.

Asya Bergal: I guess a related question is, is there plausible concrete evidence that you think could be gotten that would update you in one direction or the other significantly?

Paul Christiano: Yeah. I mean certainly, evidence will roll in once we have more powerful AI systems.

One can learn… I don’t know very much about any of the relevant institutions, I may know a little bit. So you can imagine easily learning a bunch about them by observing how well they solve analogous problems or learning about their structure, or just learning better about the views of people. That’s the second category.

We’re going to learn a bunch of shit as we continue thinking about this problem on paper to see like, does it look like we’re going to solve it or not? That kind of thing. It seems like there’s lots of sorts of evidence on lots of fronts, my views are shifting all over the place. That said, the inconsistency between one day and the next is relatively large compared to the actual changes in views from one day to the next.

Robert Long: Could you say a little bit more about evidence from once more advanced AI starts coming in? Like what sort things you’re looking for that would change your mind on things?

Paul Christiano: Well you get to see things like, on inner alignment you get to see to what extent do you have the kind of crazy shit that people are concerned about? The first time you observe some crazy shit where your AI is like, ‘I’m going to be nice in order to assure that you think I’m nice so I can stab you in the back later.’ You’re like, ‘Well, I guess that really does happen despite modest effort to prevent it.’ That’s a thing you get. You get to learn in general about how models generalize, like to what extent they tend to do– this is sort of similar to what I just said, but maybe a little bit broader– to what extent are they doing crazy-ish stuff as they generalize?

You get to learn about how reasonable simple oversight is and to what extent do ML systems acquire knowledge that simple overseers don’t have that then get exploited as they optimize in order to produce outcomes that are actually bad. I don’t have a really concise description, but sort of like, to the extent that all these arguments depend on some empirical claims about AI, you get to see those claims tested increasingly.

Ronny Fernandez: So the impression I get from talking to other people who know you, and from reading some of your blog posts, but mostly from others, is that you’re somewhat more optimistic than most people that work in AI alignment. It seems like some people who work on AI alignment think something like, ‘We’ve got to solve some really big problems that we don’t understand at all or there are a bunch of unknown unknowns that we need to figure out.’ Maybe that’s because they have a broader conception of what solving AI alignment is like than you do?

Paul Christiano: That seems like it’s likely to be part of it. It does seem like I’m more optimistic than people in general, than people who work in alignment in general. I don’t really know… I don’t understand others’ views that well and I don’t know if they’re that– like, my views aren’t that internally coherent. My suspicion is others’ views are even less internally coherent. Yeah, a lot of it is going to be done by having a narrower conception of the problem.

Then a lot of it is going to be done by me just being… in terms of do we need a lot of work to be done, a lot of it is going to be me being like, I don’t know man, maybe. I don’t really understand when people get off the like high probability of like, yeah. I don’t see the arguments that are like, definitely there’s a lot of crazy stuff to go down. It seems like we really just don’t know. I do also think problems tend to be easier. I have more of that prior, especially for problems that make sense on paper. I think they tend to either be kind of easy, or else– if they’re possible, they tend to be kind of easy. There aren’t that many really hard theorems.

Robert Long: Can you say a little bit more of what you mean by that? That’s not a very good follow-up question, I don’t really know what it would take for me to understand what you mean by that better.

Paul Christiano: Like most of the time, if I’m like, ‘here’s an algorithms problem’, you can like– if you just generate some random algorithms problems, a lot of them are going to be impossible. Then amongst the ones that are possible, a lot of them are going to be soluble in a year of effort and amongst the rest, a lot of them are going to be soluble in 10 or a hundred years of effort. It’s just kind of rare that you find a problem that’s soluble– by soluble, I don’t just mean soluble by human civilization, I mean like, they are not provably impossible– that takes a huge amount of effort.

It normally… it’s less likely to happen the cleaner the problem is. There just aren’t many very clean algorithmic problems where our society worked on it for 10 years and then we’re like, ‘Oh geez, this still seems really hard.’ Examples are kind of like… factoring is an example of a problem we’ve worked a really long time on. It kind of has the shape, and this is the tendency on these sorts of problems, where there’s just a whole bunch of solutions and we hack away and we’re a bit better and a bit better and a bit better. It’s a very messy landscape, rather than jumping from having no solution to having a solution. It’s even rarer to have things where going from no solution to some solution is really possible but incredibly hard. There were some examples.

Robert Long: And you think that the problems we face are sufficiently similar?

Paul Christiano: I mean, I think this is going more into the like, ‘I don’t know man’ but my what do I think when I say I don’t know man isn’t like, ‘Therefore, there’s an 80% chance that it’s going to be an incredibly difficult problem’ because that’s not what my prior is like. I’m like, reasonable chance it’s not that hard. Some chance it’s really hard. Probably more chance that– if it’s really hard, I think it’s more likely to be because all the clean statements of the problem are impossible. I think as statements get messier it becomes more plausible that it just takes a lot of effort. The more messy a thing is, the less likely it is to be impossible sometimes, but also the more likely it’s just a bunch of stuff you have to do.

Ronny Fernandez: It seems like one disagreement that you have with MIRI folks is that you think prosaic AGI will be easier to align than they do. Does that perception seem right to you?

Paul Christiano: I think so. I think they’re probably just like, ‘that seems probably impossible’. Was related to the previous point.

Ronny Fernandez: If you had found out that prosaic AGI is nearly impossible to align or is impossible to align, how much would that change your-

Paul Christiano: It depends exactly what you found out, exactly how you found it out, et cetera. One thing you could be told is that there’s no perfectly scalable mechanism where you can throw in your arbitrarily sophisticated AI and turn the crank and get out an arbitrarily sophisticated aligned AI. That’s a possible outcome. That’s not necessarily that damning because now you’re like okay, fine, you can almost do it basically all the time and whatever.

That’s a big class of worlds and that would definitely be a thing I would be interested in understanding– how large is that gap actually, if the nice problem was totally impossible? If at the other extreme you just told me, ‘Actually, nothing like this is at all going to work, and it’s definitely going to kill everyone if you build an AI using anything like an extrapolation of existing techniques’, then I’m like, ‘Sounds pretty bad.’ I’m still not as pessimistic as MIRI people.

I’m like, maybe people just won’t destroy the world, you know, it’s hard to say. It’s hard to say what they’ll do. It also depends on the nature of how you came to know this thing. If you came to know it in a way that’s convincing to a reasonably broad group of people, that’s better than if you came to know it and your epistemic state was similar to– I think MIRI people feel more like, it’s already known to be hard, and therefore you can tell if you can’t convince people it’s hard. Whereas I’m like, I’m not yet convinced it’s hard, so I’m not so surprised that you can’t convince people it’s hard.

Then there’s more probability, if it was known to be hard, that we can convince people, and therefore I’m optimistic about outcomes conditioned on knowing it to be hard. I might become almost as pessimistic as MIRI if I thought that the problem was insolubly hard, just going to take forever or whatever, huge gaps aligning prosaic AI, and there would be no better evidence of that than currently exists. Like there’s no way to explain it better to people than MIRI currently can. If you take those two things, I’m maybe getting closer to MIRI’s levels of doom probability. I might still not be quite as doomy as them.

Ronny Fernandez: Why does the ability to explain it matter so much?

Paul Christiano: Well, a big part of why you don’t expect people to build unaligned AI is they’re like, they don’t want to. The clearer it is and the stronger the case, the more people can potentially do something. In particular, you might get into a regime where you’re doing a bunch of shit by trial and error and trying to wing it. And if you have some really good argument that the winging it is not going to work, then that’s a very different state than if you’re like, ‘Well, winging it doesn’t seem that good. Maybe it’ll fail.’ It’s different to be like, ‘Oh no, here’s an argument. You just can’t… It’s just not going to work.’

I don’t think we’ll really be in that state, but there’s like a whole spectrum from where we’re at now to that state and I expect to be further along it, if in fact we’re doomed. For example, if I personally would be like, ‘Well, I at least tried the thing that seemed obvious to me to try and now we know that doesn’t work.’ I sort of expect very directly from trying that to learn something about why that failed and what parts of the problem seem difficult.

Ronny Fernandez: Do you have a sense of why MIRI thinks aligning prosaic AI is so hard?

Paul Christiano: We haven’t gotten a huge amount of traction on this when we’ve debated it. I think part of their position, especially on the winging it thing, is they’re like – Man, doing things right generally seems a lot harder than doing them. I guess probably building an AI will be harder in a way that’s good, for some arbitrary notion of good– a lot harder than just building an AI at all.

There’s a theme that comes up frequently trying to hash this out, and it’s not so much about a theoretical argument, it’s just like, look, the theoretical argument establishes that there’s something a little bit hard here. And once you have something a little bit hard and now you have some giant organization, people doing the random shit they’re going to do, and all that chaos, and like, getting things to work takes all these steps, and getting this harder thing to work is going to have some extra steps, and everyone’s going to be doing it. They’re more pessimistic based on those kinds of arguments.

That’s the thing that comes up a lot. I think probably most of the disagreement is still in the, you know, theoretically, how much– certainly we disagree about like, can this problem just be solved on paper in advance? Where I’m like, reasonable chance, you know? At least a third chance, they’ll just on paper be like, ‘We have nailed it.’ There’s really no tension, no additional engineering effort required. And they’re like, that’s like zero. I don’t know what they think it is. More than zero, but low.

Ronny Fernandez: Do you guys think you’re talking about the same problem exactly?

Paul Christiano: I think there we are probably. At that step we are. Just like, is your AI trying to destroy everything? Yes. No. The main place there’s some bleed over– the main thing that MIRI maybe considers in scope and I don’t is like, if you build an AI, it may someday have to build another AI. And what if the AI it builds wants to destroy everything? Is that our fault or is that the AI’s fault? And I’m more on like, that’s the AI’s fault. That’s not my job. MIRI’s maybe more like not distinguishing those super cleanly, but they would say that’s their job. The distinction is a little bit subtle in general, but-

Ronny Fernandez: I guess I’m not sure why you cashed out in terms of fault.

Paul Christiano: I think for me it’s mostly like: there’s a problem we can hope to resolve. I think there’s two big things. One is like, suppose you don’t resolve that problem. How likely is it that someone else will solve it? Saying it’s someone else’s fault is in part just saying like, ‘Look, there’s this other person who had a reasonable opportunity to solve it and it was a lot smarter than us.’ So the work we do is less likely to make the difference between it being soluble or not. Because there’s this other smarter person.

And then the other thing is like, what should you be aiming for? To the extent there’s a clean problem here which one could hope to solve, or one should bite off as a chunk, what fits in conceptually the same problem versus what’s like– you know, an analogy I sometimes make is, if you build an AI that’s doing important stuff, it might mess up in all sorts of ways. But when you’re asking, ‘Is my AI going to mess up when building a nuclear reactor?’ It’s a thing worth reasoning about as an AI person, but also like it’s worth splitting into like– part of that’s an AI problem, and part of that’s a problem about understanding managing nuclear waste. Part of that should be done by people reasoning about nuclear waste and part of it should be done by people reasoning about AI.

This is a little subtle because both of the problems have to do with AI. I would say my relationship with that is similar to like, suppose you told me that some future point, some smart people might make an AI. There’s just a meta and object level on which you could hope to help with the problem.

I’m hoping to help with the problem on the object level in the sense that we are going to do research which helps people align AI, and in particular, will help the future AI align the next AI. Because it’s like people. It’s at that level, rather than being like, ‘We’re going to construct a constitution of that AI such that when it builds future AI it will always definitely work’. This is related to like– there’s this old argument about recursive self-improvement. It’s historically figured a lot in people’s discussion of why the problem is hard, but on a naive perspective it’s not obvious why it should, because you do only a small number of large modifications before your systems are sufficiently intelligent relative to you that it seems like your work should be obsolete. Plus like, them having a bunch of detailed knowledge on the ground about what’s going down.

It seems unclear to me how– yeah, this is related to our disagreement– how much you’re happy just deferring to the future people and being like, ‘Hope that they’ll cope’. Maybe they won’t even cope by solving the problem in the same way, they might cope by, the crazy AIs that we built reach the kind of agreement that allows them to not build even crazier AIs in the same way that we might do that. I think there’s some general frame of, I’m just taking responsibility for less, and more saying, can we leave the future people in a situation that is roughly as good as our situation? And by future people, I mean mostly AIs.

Ronny Fernandez: Right. The two things that you think might explain your relative optimism are something like: Maybe we can get the problem to smarter agents that are humans. Maybe we can leave the problem to smarter agents that are not humans.

Paul Christiano: Also a lot of disagreement about the problem. Those are certainly two drivers. They’re not exhaustive in the sense that there’s also a huge amount of disagreement about like, ‘How hard is this problem?’ Which is some combination of like, ‘How much do we know about it?’ Where they’re more like, ‘Yeah, we’ve thought about it a bunch and have some views.’ And I’m like, ‘I don’t know, I don’t think I really know shit.’ Then part of it is concretely there’s a bunch of– on the object level, there’s a bunch of arguments about why it would be hard or easy so we don’t reach agreement. We consistently disagree on lots of those points.

Ronny Fernandez: Do you think the goal state for you guys is the same though? If I gave you guys a bunch of AGIs, would you guys agree about which ones are aligned and which ones are not? If you could know all of their behaviors?

Paul Christiano: I think at that level we’d probably agree. We don’t agree more broadly about what constitutes a win state or something. They have this more expansive conception– or I guess it’s narrower– that the win state is supposed to do more. They are imagining more that you’ve resolved this whole list of future challenges. I’m more not counting that.

We’ve had this… yeah, I guess I now mostly use intent alignment to refer to this problem where there’s risk of ambiguity… the problem that I used to call AI alignment. There was a long obnoxious back and forth about what the alignment problem should be called. MIRI does use aligned AI to be like, ‘an AI that produces good outcomes when you run it’. Which I really object to as a definition of aligned AI a lot. So if they’re using that as their definition of aligned AI, we would probably disagree.

Ronny Fernandez: Shifting terms or whatever… one thing that they’re trying to work on is making an AGI that has a property that is also the property you’re trying to make sure that AGI has.

Paul Christiano: Yeah, we’re all trying to build an AI that’s trying to do the right thing.

Ronny Fernandez: I guess I’m thinking more specifically, for instance, I’ve heard people at MIRI say something like, they want to build an AGI that I can tell it, ‘Hey, figure out how to copy a strawberry, and don’t mess anything else up too badly.’ Does that seem like the same problem that you’re working on?

Paul Christiano: I mean it seems like in particular, you should be able to do that. I think it’s not clear whether that captures all the complexity of the problem. That’s just sort of a question about what solutions end up looking like, whether that turns out to have the same difficulty.

The other things you might think are involved that are difficult are… well, I guess one problem is just how you capture competitiveness. Competitiveness for me is a key desideratum. And it’s maybe easy to elide in that setting, because it just makes a strawberry. Whereas I am like, if you make a strawberry literally as well as anyone else can make a strawberry, it’s just a little weird to talk about. And it’s a little weird to even formalize what competitiveness means in that setting. I think you probably can, but whether or not you do that’s not the most natural or salient aspect of the situation.

So I probably disagree with them about– I’m like, there are probably lots of ways to have agents that make strawberries and are very smart. That’s just another disagreement that’s another function of the same basic, ’How hard is the problem’ disagreement. I would guess relative to me, in part because of being more pessimistic about the problem, MIRI is more willing to settle for an AI that does one thing. And I care more about competitiveness.

Asya Bergal: Say you just learn that prosaic AI is just not going to be the way we get to AGI. How does that make you feel about the IDA approach versus the MIRI approach?

Paul Christiano: So my overall stance when I think about alignment is, there’s a bunch of possible algorithms that you could use. And the game is understanding how to align those algorithms. And it’s kind of a different game. There’s a lot of common subproblems in between different algorithms you might want to align, it’s potentially a different game for different algorithms. That’s an important part of the answer. I’m mostly focusing on the ‘align this particular’– I’ll call it learning, but it’s a little bit more specific than learning– where you search over policies to find a policy that works well in practice. If we’re not doing that, then maybe that solution is totally useless, maybe it has common subproblems with the solution you actually need. That’s one part of the answer.

Another big difference is going to be, timelines views will shift a lot if you’re handed that information. So it will depend exactly on the nature of the update. I don’t have a strong view about whether it makes my timelines shorter or longer overall. Maybe you should bracket that though.

In terms of returning to the first one of trying to align particular algorithms, I don’t know. I think I probably share some of the MIRI persp– well, no. It feels to me like there’s a lot of common subproblems. Aligning expert systems seems like it would involve a lot of the same reasoning as aligning learners. To the extent that’s true, probably future stuff also will involve a lot of the same subproblems, but I doubt the algorithm will look the same. I also doubt the actual algorithm will look anything like a particular pseudocode we might write down for iterated amplification now.

Asya Bergal: Does iterated amplification in your mind rely on this thing that searches through policies for the best policy? The way I understand it, it doesn’t feel like it necessarily does.

Paul Christiano: So, you use this distillation step. And the reason you want to do amplification, or this short-hop, expensive amplification, is because you interleave it with this distillation step. And I normally imagine the distillation step as being, learn a thing which works well in practice on a reward function defined by the overseer. You could imagine other things that also needed to have this framework, but it’s not obvious whether you need this step if you didn’t somehow get granted something like the–

Asya Bergal: That you could do the distillation step somehow.

Paul Christiano: Yeah. It’s unclear what else would– so another example of a thing that could fit in, and this maybe makes it seem more general, is if you had an agent that was just incentivized to make lots of money. Then you could just have your distillation step be like, ‘I randomly check the work of this person, and compensate them based on the work I checked’. That’s a suggestion of how this framework could end up being more general.

But I mostly do think about it in the context of learning in particular. I think it’s relatively likely to change if you’re not in that setting. Well, I don’t know. I don’t have a strong view. I’m mostly just working in that setting, mostly because it seems reasonably likely, seems reasonably likely to have a bunch in common, learning is reasonably likely to appear even if other techniques appear. That is, learning is likely to play a part in powerful AI even if other techniques also play a part.

Asya Bergal: Are there other people or resources that you think would be good for us to look at if we were looking at the optimism view?

Paul Christiano: Before we get to resources or people, I think one of the basic questions is, there’s this perspective which is fairly common in ML, which is like, ‘We’re kind of just going to do a bunch of stuff, and it’ll probably work out’. That’s probably the basic thing to be getting at. How right is that?

This is the bad view of safety conditioned on– I feel like prosaic AI is in some sense the worst– seems like about as bad as things would have gotten in terms of alignment. Where, I don’t know, you try a bunch of shit, just a ton of stuff, a ton of trial and error seems pretty bad. Anyway, this is a random aside maybe more related to the previous point. But yeah, this is just with alignment. There’s this view in ML that’s relatively common that’s like, we’ll try a bunch of stuff to get the AI to do what we want, it’ll probably work out. Some problems will come up. We’ll probably solve them. I think that’s probably the most important thing in the optimism vs pessimism side.

And I don’t know, I mean this has been a project that like, it’s a hard project. I think the current state of affairs is like, the MIRI folk have strong intuitions about things being hard. Essentially no one in… very few people in ML agree with those, or even understand where they’re coming from. And even people in the EA community who have tried a bunch to understand where they’re coming from mostly don’t. Mostly people either end up understanding one side or the other and don’t really feel like they’re able to connect everything. So it’s an intimidating project in that sense. I think the MIRI people are the main proponents of the everything is doomed, the people to talk to on that side. And then in some sense there’s a lot of people on the other side who you can talk to, and the question is just, who can articulate the view most clearly? Or who has most engaged with the MIRI view such that they can speak to it?

Ronny Fernandez: Those are people I would be particularly interested in. If there are people that understand all the MIRI arguments but still have broadly the perspective you’re describing, like some problems will come up, probably we’ll fix them.

Paul Christiano: I don’t know good– I don’t have good examples of people for you. I think most people just find the MIRI view kind of incomprehensible, or like, it’s a really complicated thing, even if the MIRI view makes sense in its face. I don’t think people have gotten enough into the weeds. It really rests a lot right now on this fairly complicated cluster of intuitions. I guess on the object level, I think I’ve just engaged a lot more with the MIRI view than most people who are– who mostly take the ‘everything will be okay’ perspective. So happy to talk on the object level, and speaking more to arguments. I think it’s a hard thing to get into, but it’s going to be even harder to find other people in ML who have engaged with the view that much.

They might be able to make other general criticisms of like, here’s why I haven’t really… like it doesn’t seem like a promising kind of view to think about. I think you could find more people who have engaged at that level. I don’t know who I would recommend exactly, but I could think about it. Probably a big question will be who is excited to talk to you about it.

Asya Bergal: I am curious about your response to MIRI’s object level arguments. Is there a place that exists somewhere?

Paul Christiano: There’s some back and forth on the internet. I don’t know if it’s great. There’s some LessWrong posts. Eliezer for example wrote this post about why things were doomed, why I in particular was doomed. I don’t know if you read that post.

Asya Bergal: I can also ask you about it now, I just don’t want to take too much of your time if it’s a huge body of things.

Paul Christiano: The basic argument would be like, 1) On paper I don’t think we yet have a good reason to feel doomy. And I think there’s some basic research intuition about how much a problem– suppose you poke at a problem a few times, and you’re like ‘Agh, seems hard to make progress’. How much do you infer that the problem’s really hard? And I’m like, not much. As a person who’s poked at a bunch of problems, let me tell you, that often doesn’t work and then you solve in like 10 years of effort.

So that’s one thing. That’s a point where I have relatively little sympathy for the MIRI way. That’s one set of arguments: is there a good way to get traction on this problem? Are there clever algorithms? I’m like, I don’t know, I don’t feel like the kind of evidence we’ve seen is the kind of evidence that should be persuasive. As some evidence in that direction, I’d be like, I have not been thinking about this that long. I feel like there have often been things that felt like, or that MIRI would have defended as like, here’s a hard obstruction. Then you think about it and you’re actually like, ‘Here are some things you can do.’ And it may still be a obstruction, but it’s no longer quite so obvious where it is, and there were avenues of attack.

That’s one thing. The second thing is like, a metaphor that makes me feel good– MIRI talks a lot about the evolution analogy. If I imagine the evolution problem– so if I’m a person, and I’m breeding some animals, I’m breeding some superintelligence. Suppose I wanted to breed an animal modestly smarter than humans that is really docile and friendly. I’m like, I don’t know man, that seems like it might work. That’s where I’m at. I think they are… it’s been a little bit hard to track down this disagreement, and I think this is maybe in a fresher, rawer state than the other stuff, where we haven’t had enough back and forth.

But I’m like, it doesn’t sound necessarily that hard. I just don’t know. I think their position, their position when they’ve written something has been a little bit more like, ‘But you couldn’t breed a thing, that after undergoing radical changes in intelligence or situation would remain friendly’. But then I’m normally like, but it’s not clear why that’s needed? I would really just like to create something slightly superhuman, and it’s going to work with me to breed something that’s slightly smarter still that is friendly.

We haven’t really been able to get traction on that. I think they have an intuition that maybe there’s some kind of invariance and things become gradually more unraveled as you go on. Whereas I have more intuition that it’s plausible. After this generation, there’s just smarter and smarter people thinking about how to keep everything on the rails. It’s very hard to know.

That’s the second thing. I have found that really… that feels like it gets to the heart of some intuitions that are very different, and I don’t understand what’s up there. There’s a third category which is like, on the object level, there’s a lot of directions that I’m enthusiastic about where they’re like, ‘That seems obviously doomed’. So you could divide those up into the two problems. There’s the family of problems that are more like the inner alignment problem, and then outer alignment stuff.

On the inner alignment stuff, I haven’t thought that much about it, but examples of things that I’m optimistic about that they’re super pessimistic about are like, stuff that looks more like verification, or maybe stepping back even for that, there’s this basic paradigm of adversarial training, where I’m like, it seems close to working. And you could imagine it being like, it’s just a research problem to fill in the gaps. Whereas they’re like, that’s so not the kind of thing that would work. I don’t really know where we’re at with that. I do see there are formal obstructions to adversarial training in particular working. I’m like, I see why this is not yet a solution. For example, you can have this case where there’s a predicate that the model checks, and it’s easy to check but hard to construct examples. And then in your adversarial training you can’t ever feed an example where it’ll fail. So we get into like, is it plausible that you can handle that problem with either 1) Doing something more like verification, where you say, you ask them not to perform well on real inputs but on pseudo inputs. Or like, you ask the attacker just to show how it’s conceivable that the model could do a bad thing in some sense.

That’s one possible approach, where the other would be something more like interpretability, where you say like, ‘Here’s what the model is doing. In addition to it’s behavior we get this other signal that the paper was depending on this fact, its predicate paths, which it shouldn’t have been dependent on.’ The question is, can either of those yield good behavior? I’m like, I don’t know, man. It seems plausible. And they’re like ‘Definitely not.’ And I’m like, ‘Why definitely not?’ And they’re like ‘Well, that’s not getting at the real essence of the problem.’ And I’m like ‘Okay, great, but how did you substantiate this notion of the real essence of the problem? Where is that coming from? Is that coming from a whole bunch of other solutions that look plausible that failed?’ And their take is kind of like, yes, and I’m like, ‘But none of those– there weren’t actually even any candidate solutions there really that failed yet. You’ve got maybe one thing, or like, you showed there exists a problem in some minimal sense.’ This comes back to the first of the three things I listed. But it’s a little bit different in that I think you can just stare at particular things and they’ll be like, ‘Here’s how that particular thing is going to fail.’ And I’m like ‘I don’t know, it seems plausible.’

That’s on inner alignment. And there’s maybe some on outer alignment. I feel like they’ve given a lot of ground in the last four years on how doomy things seem on outer alignment. I think they still have some– if we’re talking about amplification, I think the position would still be, ‘Man, why would that agent be aligned? It doesn’t at all seem like it would be aligned.’ That has also been a little bit surprisingly tricky to make progress on. I think it’s similar, where I’m like, yeah, I grant the existence of some problem or some thing which needs to be established, but I don’t grant– I think their position would be like, this hasn’t made progress or just like, pushed around the core difficulty. I’m like, I don’t grant the conception of the core difficulty in which this has just pushed around the core difficulty. I think that… substantially in that kind of thing, being like, here’s an approach that seems plausible, we don’t have a clear obstruction but I think that it is doomed for these deep reasons. I have maybe a higher bar for what kind of support the deep reasons need.

I also just think on the merits, they have not really engaged with– and this is partly my responsibility for not having articulated the arguments in a clear enough way– although I think they have not engaged with even the clearest articulation as of two years ago of what the hope was. But that’s probably on me for not having an even clearer articulation than that, and also definitely not up to them to engage with anything. To the extent it’s a moving target, not up to them to engage with the most recent version. Where, most recent version– the proposal doesn’t really change that much, or like, the case for optimism has changed a little bit. But it’s mostly just like, the state of argument concerning it, rather than the version of the scheme.