Conversation with Paul Christiano

Link post

AI Im­pacts talked to AI safety re­searcher Paul Chris­ti­ano about his views on AI risk. With his per­mis­sion, we have tran­scribed this in­ter­view.

Participants

Summary

We spoke with Paul Chris­ti­ano on Au­gust 13, 2019. Here is a brief sum­mary of that con­ver­sa­tion:

  • AI safety is worth work­ing on be­cause AI poses a large risk and AI safety is ne­glected, and tractable.

  • Chris­ti­ano is more op­ti­mistic about the likely so­cial con­se­quences of ad­vanced AI than some oth­ers in AI safety, in par­tic­u­lar re­searchers at the Ma­chine In­tel­li­gence Re­search In­sti­tute (MIRI), for the fol­low­ing rea­sons:

    • The prior on any given prob­lem re­duc­ing the ex­pected value of the fu­ture by 10% should be low.

    • There are sev­eral ‘sav­ing throws’–ways in which, even if one thing turns out badly, some­thing else can turn out well, such that AI is not catas­trophic.

    • Many al­gorith­mic prob­lems are ei­ther solv­able within 100 years, or prov­ably im­pos­si­ble; this in­clines Chris­ti­ano to think that AI safety prob­lems are rea­son­ably likely to be easy.

    • MIRI thinks suc­cess is guaran­tee­ing that un­al­igned in­tel­li­gences are never cre­ated, whereas Chris­ti­ano just wants to leave the next gen­er­a­tion of in­tel­li­gences in at least as good of a place as hu­mans were when build­ing them.

    • ‘Pro­saic AI’ that looks like cur­rent AI sys­tems will be less hard to al­ign than MIRI thinks:

      • Chris­ti­ano thinks there’s at least a one-in-three chance that we’ll be able to solve AI safety on pa­per in ad­vance.

      • A com­mon view within ML is that that we’ll suc­cess­fully solve prob­lems as they come up.

    • Chris­ti­ano has rel­a­tively less con­fi­dence in sev­eral in­side view ar­gu­ments for high lev­els of risk:

      • Build­ing safe AI re­quires hit­ting a small tar­get in the space of pro­grams, but build­ing any AI also re­quires hit­ting a small tar­get.

      • Be­cause Chris­ti­ano thinks that the state of ev­i­dence is less clear-cut than MIRI does, Chris­ti­ano also has a higher prob­a­bil­ity that peo­ple will be­come more wor­ried in the fu­ture.

      • Just be­cause we haven’t solved many prob­lems in AI safety yet doesn’t mean they’re in­tractably hard– many tech­ni­cal prob­lems feel this way and then get solved in 10 years of effort.

      • Evolu­tion is of­ten used as an anal­ogy to ar­gue that gen­eral in­tel­li­gence (hu­mans with their own goals) be­comes dan­ger­ously un­al­igned with the goals of the outer op­ti­mizer (evolu­tion se­lect­ing for re­pro­duc­tive fit­ness). But this anal­ogy doesn’t make Chris­ti­ano feel so pes­simistic, e.g. he thinks that if we tried, we could breed an­i­mals that are some­what smarter than hu­mans and are also friendly and docile.

      • Chris­ti­ano is op­ti­mistic about ver­ifi­ca­tion, in­ter­pretabil­ity, and ad­ver­sar­ial train­ing for in­ner al­ign­ment, whereas MIRI is pes­simistic.

      • MIRI thinks the outer al­ign­ment ap­proaches Chris­ti­ano pro­poses are just ob­scur­ing the core difficul­ties of al­ign­ment, while Chris­ti­ano is not yet con­vinced there is a deep core difficulty.

  • Chris­ti­ano thinks there are sev­eral things that could change his mind and op­ti­mism lev­els, in­clud­ing:

    • Learn­ing about in­sti­tu­tions and ob­serv­ing how they solve prob­lems analo­gous to AI safety.

    • See­ing whether AIs be­come de­cep­tive and how they re­spond to sim­ple over­sight.

    • See­ing how much progress we make on AI al­ign­ment over the com­ing years.

  • Chris­ti­ano is rel­a­tively op­ti­mistic about his iter­ated am­plifi­ca­tion ap­proach:

    • Chris­ti­ano cares more about mak­ing al­igned AIs that are com­pet­i­tive with un­al­igned AIs, whereas MIRI is more will­ing to set­tle for an AI with very nar­row ca­pa­bil­ities.

    • Iter­ated am­plifi­ca­tion is largely based on learn­ing-based AI sys­tems, though it may work in other cases.

    • Even if iter­ated am­plifi­ca­tion isn’t the an­swer to AI safety, it’s likely to have sub­prob­lems in com­mon with prob­lems that are im­por­tant in the fu­ture.

  • There are still many dis­agree­ments be­tween Chris­ti­ano and the Ma­chine In­tel­li­gence Re­search In­sti­tute (MIRI) that are messy and haven’t been made pre­cise.

This tran­script has been lightly ed­ited for con­ci­sion and clar­ity.

Transcript

Asya Ber­gal: Okay. We are record­ing. I’m go­ing to ask you a bunch of ques­tions re­lated to some­thing like AI op­ti­mism.

I guess the propo­si­tion that we’re look­ing at is some­thing like ‘is it valuable for peo­ple to be spend­ing sig­nifi­cant effort do­ing work that pur­ports to re­duce the risk from ad­vanced ar­tifi­cial in­tel­li­gence’? The first ques­tion would be to give a short-ish ver­sion of the rea­son­ing around that.

Paul Chris­ti­ano: Around why it’s over­all valuable?

Asya Ber­gal: Yeah. Or the ex­tent to which you think it’s valuable.

Paul Chris­ti­ano: I don’t know, this seems com­pli­cated. I’m act­ing from some longter­merist per­spec­tive, I’m like, what can make the world ir­re­versibly worse? There aren’t that many things, we go ex­tinct. It’s hard to go ex­tinct, doesn’t seem that likely.

Robert Long: We keep for­get­ting to say this, but we are fo­cus­ing less on eth­i­cal con­sid­er­a­tions that might af­fect that. We’ll grant…yeah, with all that in the back­ground….

Paul Chris­ti­ano: Grant­ing long-ter­mism, but then it seems like it de­pends a lot on what’s the prob­a­bil­ity? What frac­tion of our ex­pected fu­ture do we lose by virtue of mess­ing up al­ign­ment * what’s the elas­tic­ity of that to effort /​ how much effort?

Robert Long: That’s the stuff we’re cu­ri­ous to see what peo­ple think about.

Asya Ber­gal: I also just read your 80K in­ter­view, which I think prob­a­bly cov­ered like a lot of the rea­son­ing about this.

Paul Chris­ti­ano: They prob­a­bly did. I don’t re­mem­ber ex­actly what’s in there, but it was a lot of words.

I don’t know. I’m like, it’s a lot of doom prob­a­bil­ity. Like maybe I think AI al­ign­ment per se is like 10% doomi­ness. That’s a lot. Then it seems like if we un­der­stood ev­ery­thing in ad­vance re­ally well, or just hav­ing a bunch of peo­ple work­ing on now un­der­stand­ing what’s up, could eas­ily re­duce that by a big chunk.

Ronny Fer­nan­dez: Sorry, what do you mean by 10% doomi­nesss?

Paul Chris­ti­ano: I don’t know, the fu­ture is 10% worse than it would oth­er­wise be in ex­pec­ta­tion by virtue of our failure to al­ign AI. I made up 10%, it’s kind of a ran­dom num­ber. I don’t know, it’s less than 50%. It’s more than 10% con­di­tioned on AI soon I think.

Ronny Fer­nan­dez: And that’s change in ex­pected value.

Paul Chris­ti­ano: Yeah. Any­way, so 10% is a lot. Then I’m like, maybe if we sorted all our shit out and had a bunch of peo­ple who knew what was up, and had a good the­o­ret­i­cal pic­ture of what was up, and had more info available about whether it was a real prob­lem. Maybe re­ally nailing all that could cut that risk from 10% to 5% and maybe like, you know, there aren’t that many peo­ple who work on it, it seems like a marginal per­son can eas­ily do a thou­sandth of that 5% change. Now you’re look­ing at one in 20,000 or some­thing, which is a good deal.

Asya Ber­gal: I think my im­pres­sion is that that 10% is lower than some large set of peo­ple. I don’t know if other peo­ple agree with that.

Paul Chris­ti­ano: Cer­tainly, 10% is lower than lots of peo­ple who care about AI risk. I mean it’s worth say­ing, that I have this slightly nar­row con­cep­tion of what is the al­ign­ment prob­lem. I’m not in­clud­ing all AI risk in the 10%. I’m not in­clud­ing in some sense most of the things peo­ple nor­mally worry about and just in­clud­ing the like ‘we tried to build an AI that was do­ing what we want but then it wasn’t even try­ing to do what we want’. I think it’s lower now or even af­ter that caveat, than pes­simistic peo­ple. It’s go­ing to be lower than all the MIRI folks, it’s go­ing to be higher than al­most ev­ery­one in the world at large, es­pe­cially af­ter spe­cial­iz­ing in this prob­lem, which is a prob­lem al­most no one cares about, which is pre­cisely how a thou­sand full time peo­ple for 20 years can re­duce the whole risk by half or some­thing.

Asya Ber­gal: I’m cu­ri­ous for your state­ment as to why you think your num­ber is slightly lower than other peo­ple.

Paul Chris­ti­ano: Yeah, I don’t know if I have a par­tic­u­larly crisp an­swer. Seems like it’s a more re­ac­tive thing of like, what are the ar­gu­ments that it’s very doomy? A pri­ori you might’ve been like, well, if you’re go­ing to build some AI, you’re prob­a­bly go­ing to build the AI so it’s try­ing to do what you want it to do. Prob­a­bly that’s that. Plus, most things can’t de­stroy the ex­pected value of the fu­ture by 10%. You just can’t have that many things, oth­er­wise there’s not go­ing to be any value left in the end. In par­tic­u­lar, if you had 100 such things, then you’d be down to like 1/​1000th of your val­ues. 110 hun­dred thou­sandth? I don’t know, I’m not good at ar­ith­metic.

Any­way, that’s a pri­ori, just aren’t that many things are that bad and it seems like peo­ple would try and make AI that’s try­ing to do what they want. Then you’re like, okay, we get to be pes­simistic be­cause of some other ar­gu­ment about like, well, we don’t cur­rently know how to build an AI which will do what we want. We’re like, there’s some ex­trap­o­la­tion of cur­rent tech­niques on which we’re con­cerned that we wouldn’t be able to. Or maybe some more con­cep­tual or in­tu­itive ar­gu­ment about why AI is a scary kind thing, and AIs tend to want to do ran­dom shit.

Then like, I don’t know, now we get into, how strong is that ar­gu­ment for doomi­ness? Then a ma­jor thing that drives it is I am like, rea­son­able chance there is no prob­lem in fact. Rea­son­able chance, if there is a prob­lem we can cope with it just by try­ing. Rea­son­able chance, even if it will be hard to cope with, we can sort shit out well enough on pa­per that we re­ally nail it and un­der­stand how to re­solve it. Rea­son­able chance, if we don’t solve it the peo­ple will just not build AIs that de­stroy ev­ery­thing they value.

It’s lots of sav­ing throws, you know? And you mul­ti­ply the sav­ing throws to­gether and things look bet­ter. And they in­ter­act bet­ter than that be­cause– well, in one way worse be­cause it’s cor­re­lated: If you’re in­com­pe­tent, you’re more likely to fail to solve the prob­lem and more likely to fail to co­or­di­nate not to de­stroy the world. In some other sense, it’s bet­ter than in­ter­act­ing mul­ti­plica­tively be­cause weak­ness in one area com­pen­sates for strength in the other. I think there are a bunch of sav­ing throws that could in­de­pen­dently make things good, but then in re­al­ity you have to have a lit­tle bit here and a lit­tle bit here and a lit­tle bit here, if that makes sense. We have some rea­son­able un­der­stand­ing on pa­per that makes the prob­lem eas­ier. The prob­lem wasn’t that bad. We wing it rea­son­ably well and we do a bunch of work and in fact peo­ple are just like, ‘Okay, we’re not go­ing to de­stroy the world given the choice.’ I guess I have this some­what dis­tinc­tive last sav­ing throw where I’m like, ‘Even if you have un­al­igned AI, it’s prob­a­bly not that bad.’

That doesn’t do much of the work, but you know you add a bunch of shit like that to­gether.

Asya Ber­gal: That’s a lot of prob­a­bil­ity mass on a lot of differ­ent things. I do feel like my im­pres­sion is that, on the first step of whether by de­fault things are likely to be okay or things are likely to be good, peo­ple make ar­gu­ments of the form, ‘You have a thing with a goal and it’s so hard to spec­ify. By de­fault, you should as­sume that the space of pos­si­ble goals to spec­ify is big, and the one right goal is hard to spec­ify, hard to find.’ Ob­vi­ously, this is mod­el­ing the thing as an agent, which is already an as­sump­tion.

Paul Chris­ti­ano: Yeah. I mean it’s hard to run or have much con­fi­dence in ar­gu­ments of that form. I think it’s pos­si­ble to run tight ver­sions of that ar­gu­ment that are sug­ges­tive. It’s hard to have much con­fi­dence in part be­cause you’re like, look, the space of all pro­grams is very broad, and the space that do your taxes is quite small, and we in fact are do­ing a lot of se­lect­ing from the vast space of pro­grams to find one that does your taxes– so like, you’ve already done a lot of that.

And then you have to be get­ting into more de­tailed ar­gu­ments about ex­actly how hard is it to se­lect. I think there’s two kinds of ar­gu­ments you can make that are differ­ent, or which I sep­a­rate. One is the in­ner al­ign­ment treach­er­ous tur­ney ar­gu­ment, where like, we can’t tell the differ­ence be­tween AIs that are do­ing the right and wrong thing, even if you know what’s right be­cause blah blah blah. The other is well, you don’t have this test for ‘was it right’ and so you can’t be se­lect­ing for ‘does the right thing’.

This is a place where the con­cern is dis­junc­tive, you have like two differ­ent things, they’re both sit­ting in your al­ign­ment prob­lem. They can again in­ter­act badly. But like, I don’t know, I don’t think you’re go­ing to get to high prob­a­bil­ities from this. I think I would kind of be at like, well I don’t know. Maybe I think it’s more likely than not that there’s a real prob­lem but not like 90%, you know? Like maybe I’m like two to one that there ex­ists a non-triv­ial prob­lem or some­thing like that. All of the num­bers I’m go­ing to give are very made up though. If you asked me a sec­ond time you’ll get all differ­ent num­bers.

Asya Ber­gal: That’s good to know.

Paul Chris­ti­ano: Some­times I an­chor on past things I’ve said though, un­for­tu­nately.

Asya Ber­gal: Okay. Maybe I should give you some fake past Paul num­bers.

Paul Chris­ti­ano: You could be like, ‘In that in­ter­view, you said that it was 85%’. I’d be like, ‘I think it’s re­ally prob­a­bly 82%’.

Asya Ber­gal: I guess a re­lated ques­tion is, is there plau­si­ble con­crete ev­i­dence that you think could be got­ten that would up­date you in one di­rec­tion or the other sig­nifi­cantly?

Paul Chris­ti­ano: Yeah. I mean cer­tainly, ev­i­dence will roll in once we have more pow­er­ful AI sys­tems.

One can learn… I don’t know very much about any of the rele­vant in­sti­tu­tions, I may know a lit­tle bit. So you can imag­ine eas­ily learn­ing a bunch about them by ob­serv­ing how well they solve analo­gous prob­lems or learn­ing about their struc­ture, or just learn­ing bet­ter about the views of peo­ple. That’s the sec­ond cat­e­gory.

We’re go­ing to learn a bunch of shit as we con­tinue think­ing about this prob­lem on pa­per to see like, does it look like we’re go­ing to solve it or not? That kind of thing. It seems like there’s lots of sorts of ev­i­dence on lots of fronts, my views are shift­ing all over the place. That said, the in­con­sis­tency be­tween one day and the next is rel­a­tively large com­pared to the ac­tual changes in views from one day to the next.

Robert Long: Could you say a lit­tle bit more about ev­i­dence from once more ad­vanced AI starts com­ing in? Like what sort things you’re look­ing for that would change your mind on things?

Paul Chris­ti­ano: Well you get to see things like, on in­ner al­ign­ment you get to see to what ex­tent do you have the kind of crazy shit that peo­ple are con­cerned about? The first time you ob­serve some crazy shit where your AI is like, ‘I’m go­ing to be nice in or­der to as­sure that you think I’m nice so I can stab you in the back later.’ You’re like, ‘Well, I guess that re­ally does hap­pen de­spite mod­est effort to pre­vent it.’ That’s a thing you get. You get to learn in gen­eral about how mod­els gen­er­al­ize, like to what ex­tent they tend to do– this is sort of similar to what I just said, but maybe a lit­tle bit broader– to what ex­tent are they do­ing crazy-ish stuff as they gen­er­al­ize?

You get to learn about how rea­son­able sim­ple over­sight is and to what ex­tent do ML sys­tems ac­quire knowl­edge that sim­ple over­seers don’t have that then get ex­ploited as they op­ti­mize in or­der to pro­duce out­comes that are ac­tu­ally bad. I don’t have a re­ally con­cise de­scrip­tion, but sort of like, to the ex­tent that all these ar­gu­ments de­pend on some em­piri­cal claims about AI, you get to see those claims tested in­creas­ingly.

Ronny Fer­nan­dez: So the im­pres­sion I get from talk­ing to other peo­ple who know you, and from read­ing some of your blog posts, but mostly from oth­ers, is that you’re some­what more op­ti­mistic than most peo­ple that work in AI al­ign­ment. It seems like some peo­ple who work on AI al­ign­ment think some­thing like, ‘We’ve got to solve some re­ally big prob­lems that we don’t un­der­stand at all or there are a bunch of un­known un­knowns that we need to figure out.’ Maybe that’s be­cause they have a broader con­cep­tion of what solv­ing AI al­ign­ment is like than you do?

Paul Chris­ti­ano: That seems like it’s likely to be part of it. It does seem like I’m more op­ti­mistic than peo­ple in gen­eral, than peo­ple who work in al­ign­ment in gen­eral. I don’t re­ally know… I don’t un­der­stand oth­ers’ views that well and I don’t know if they’re that– like, my views aren’t that in­ter­nally co­her­ent. My sus­pi­cion is oth­ers’ views are even less in­ter­nally co­her­ent. Yeah, a lot of it is go­ing to be done by hav­ing a nar­rower con­cep­tion of the prob­lem.

Then a lot of it is go­ing to be done by me just be­ing… in terms of do we need a lot of work to be done, a lot of it is go­ing to be me be­ing like, I don’t know man, maybe. I don’t re­ally un­der­stand when peo­ple get off the like high prob­a­bil­ity of like, yeah. I don’t see the ar­gu­ments that are like, definitely there’s a lot of crazy stuff to go down. It seems like we re­ally just don’t know. I do also think prob­lems tend to be eas­ier. I have more of that prior, es­pe­cially for prob­lems that make sense on pa­per. I think they tend to ei­ther be kind of easy, or else– if they’re pos­si­ble, they tend to be kind of easy. There aren’t that many re­ally hard the­o­rems.

Robert Long: Can you say a lit­tle bit more of what you mean by that? That’s not a very good fol­low-up ques­tion, I don’t re­ally know what it would take for me to un­der­stand what you mean by that bet­ter.

Paul Chris­ti­ano: Like most of the time, if I’m like, ‘here’s an al­gorithms prob­lem’, you can like– if you just gen­er­ate some ran­dom al­gorithms prob­lems, a lot of them are go­ing to be im­pos­si­ble. Then amongst the ones that are pos­si­ble, a lot of them are go­ing to be sol­u­ble in a year of effort and amongst the rest, a lot of them are go­ing to be sol­u­ble in 10 or a hun­dred years of effort. It’s just kind of rare that you find a prob­lem that’s sol­u­ble– by sol­u­ble, I don’t just mean sol­u­ble by hu­man civ­i­liza­tion, I mean like, they are not prov­ably im­pos­si­ble– that takes a huge amount of effort.

It nor­mally… it’s less likely to hap­pen the cleaner the prob­lem is. There just aren’t many very clean al­gorith­mic prob­lems where our so­ciety worked on it for 10 years and then we’re like, ‘Oh geez, this still seems re­ally hard.’ Ex­am­ples are kind of like… fac­tor­ing is an ex­am­ple of a prob­lem we’ve worked a re­ally long time on. It kind of has the shape, and this is the ten­dency on these sorts of prob­lems, where there’s just a whole bunch of solu­tions and we hack away and we’re a bit bet­ter and a bit bet­ter and a bit bet­ter. It’s a very messy land­scape, rather than jump­ing from hav­ing no solu­tion to hav­ing a solu­tion. It’s even rarer to have things where go­ing from no solu­tion to some solu­tion is re­ally pos­si­ble but in­cred­ibly hard. There were some ex­am­ples.

Robert Long: And you think that the prob­lems we face are suffi­ciently similar?

Paul Chris­ti­ano: I mean, I think this is go­ing more into the like, ‘I don’t know man’ but my what do I think when I say I don’t know man isn’t like, ‘There­fore, there’s an 80% chance that it’s go­ing to be an in­cred­ibly difficult prob­lem’ be­cause that’s not what my prior is like. I’m like, rea­son­able chance it’s not that hard. Some chance it’s re­ally hard. Prob­a­bly more chance that– if it’s re­ally hard, I think it’s more likely to be be­cause all the clean state­ments of the prob­lem are im­pos­si­ble. I think as state­ments get messier it be­comes more plau­si­ble that it just takes a lot of effort. The more messy a thing is, the less likely it is to be im­pos­si­ble some­times, but also the more likely it’s just a bunch of stuff you have to do.

Ronny Fer­nan­dez: It seems like one dis­agree­ment that you have with MIRI folks is that you think pro­saic AGI will be eas­ier to al­ign than they do. Does that per­cep­tion seem right to you?

Paul Chris­ti­ano: I think so. I think they’re prob­a­bly just like, ‘that seems prob­a­bly im­pos­si­ble’. Was re­lated to the pre­vi­ous point.

Ronny Fer­nan­dez: If you had found out that pro­saic AGI is nearly im­pos­si­ble to al­ign or is im­pos­si­ble to al­ign, how much would that change your-

Paul Chris­ti­ano: It de­pends ex­actly what you found out, ex­actly how you found it out, et cetera. One thing you could be told is that there’s no perfectly scal­able mechanism where you can throw in your ar­bi­trar­ily so­phis­ti­cated AI and turn the crank and get out an ar­bi­trar­ily so­phis­ti­cated al­igned AI. That’s a pos­si­ble out­come. That’s not nec­es­sar­ily that damn­ing be­cause now you’re like okay, fine, you can al­most do it ba­si­cally all the time and what­ever.

That’s a big class of wor­lds and that would definitely be a thing I would be in­ter­ested in un­der­stand­ing– how large is that gap ac­tu­ally, if the nice prob­lem was to­tally im­pos­si­ble? If at the other ex­treme you just told me, ‘Ac­tu­ally, noth­ing like this is at all go­ing to work, and it’s definitely go­ing to kill ev­ery­one if you build an AI us­ing any­thing like an ex­trap­o­la­tion of ex­ist­ing tech­niques’, then I’m like, ‘Sounds pretty bad.’ I’m still not as pes­simistic as MIRI peo­ple.

I’m like, maybe peo­ple just won’t de­stroy the world, you know, it’s hard to say. It’s hard to say what they’ll do. It also de­pends on the na­ture of how you came to know this thing. If you came to know it in a way that’s con­vinc­ing to a rea­son­ably broad group of peo­ple, that’s bet­ter than if you came to know it and your epistemic state was similar to– I think MIRI peo­ple feel more like, it’s already known to be hard, and there­fore you can tell if you can’t con­vince peo­ple it’s hard. Whereas I’m like, I’m not yet con­vinced it’s hard, so I’m not so sur­prised that you can’t con­vince peo­ple it’s hard.

Then there’s more prob­a­bil­ity, if it was known to be hard, that we can con­vince peo­ple, and there­fore I’m op­ti­mistic about out­comes con­di­tioned on know­ing it to be hard. I might be­come al­most as pes­simistic as MIRI if I thought that the prob­lem was in­sol­ubly hard, just go­ing to take for­ever or what­ever, huge gaps al­ign­ing pro­saic AI, and there would be no bet­ter ev­i­dence of that than cur­rently ex­ists. Like there’s no way to ex­plain it bet­ter to peo­ple than MIRI cur­rently can. If you take those two things, I’m maybe get­ting closer to MIRI’s lev­els of doom prob­a­bil­ity. I might still not be quite as doomy as them.

Ronny Fer­nan­dez: Why does the abil­ity to ex­plain it mat­ter so much?

Paul Chris­ti­ano: Well, a big part of why you don’t ex­pect peo­ple to build un­al­igned AI is they’re like, they don’t want to. The clearer it is and the stronger the case, the more peo­ple can po­ten­tially do some­thing. In par­tic­u­lar, you might get into a regime where you’re do­ing a bunch of shit by trial and er­ror and try­ing to wing it. And if you have some re­ally good ar­gu­ment that the wing­ing it is not go­ing to work, then that’s a very differ­ent state than if you’re like, ‘Well, wing­ing it doesn’t seem that good. Maybe it’ll fail.’ It’s differ­ent to be like, ‘Oh no, here’s an ar­gu­ment. You just can’t… It’s just not go­ing to work.’

I don’t think we’ll re­ally be in that state, but there’s like a whole spec­trum from where we’re at now to that state and I ex­pect to be fur­ther along it, if in fact we’re doomed. For ex­am­ple, if I per­son­ally would be like, ‘Well, I at least tried the thing that seemed ob­vi­ous to me to try and now we know that doesn’t work.’ I sort of ex­pect very di­rectly from try­ing that to learn some­thing about why that failed and what parts of the prob­lem seem difficult.

Ronny Fer­nan­dez: Do you have a sense of why MIRI thinks al­ign­ing pro­saic AI is so hard?

Paul Chris­ti­ano: We haven’t got­ten a huge amount of trac­tion on this when we’ve de­bated it. I think part of their po­si­tion, es­pe­cially on the wing­ing it thing, is they’re like – Man, do­ing things right gen­er­ally seems a lot harder than do­ing them. I guess prob­a­bly build­ing an AI will be harder in a way that’s good, for some ar­bi­trary no­tion of good– a lot harder than just build­ing an AI at all.

There’s a theme that comes up fre­quently try­ing to hash this out, and it’s not so much about a the­o­ret­i­cal ar­gu­ment, it’s just like, look, the the­o­ret­i­cal ar­gu­ment es­tab­lishes that there’s some­thing a lit­tle bit hard here. And once you have some­thing a lit­tle bit hard and now you have some gi­ant or­ga­ni­za­tion, peo­ple do­ing the ran­dom shit they’re go­ing to do, and all that chaos, and like, get­ting things to work takes all these steps, and get­ting this harder thing to work is go­ing to have some ex­tra steps, and ev­ery­one’s go­ing to be do­ing it. They’re more pes­simistic based on those kinds of ar­gu­ments.

That’s the thing that comes up a lot. I think prob­a­bly most of the dis­agree­ment is still in the, you know, the­o­ret­i­cally, how much– cer­tainly we dis­agree about like, can this prob­lem just be solved on pa­per in ad­vance? Where I’m like, rea­son­able chance, you know? At least a third chance, they’ll just on pa­per be like, ‘We have nailed it.’ There’s re­ally no ten­sion, no ad­di­tional en­g­ineer­ing effort re­quired. And they’re like, that’s like zero. I don’t know what they think it is. More than zero, but low.

Ronny Fer­nan­dez: Do you guys think you’re talk­ing about the same prob­lem ex­actly?

Paul Chris­ti­ano: I think there we are prob­a­bly. At that step we are. Just like, is your AI try­ing to de­stroy ev­ery­thing? Yes. No. The main place there’s some bleed over– the main thing that MIRI maybe con­sid­ers in scope and I don’t is like, if you build an AI, it may some­day have to build an­other AI. And what if the AI it builds wants to de­stroy ev­ery­thing? Is that our fault or is that the AI’s fault? And I’m more on like, that’s the AI’s fault. That’s not my job. MIRI’s maybe more like not dis­t­in­guish­ing those su­per cleanly, but they would say that’s their job. The dis­tinc­tion is a lit­tle bit sub­tle in gen­eral, but-

Ronny Fer­nan­dez: I guess I’m not sure why you cashed out in terms of fault.

Paul Chris­ti­ano: I think for me it’s mostly like: there’s a prob­lem we can hope to re­solve. I think there’s two big things. One is like, sup­pose you don’t re­solve that prob­lem. How likely is it that some­one else will solve it? Say­ing it’s some­one else’s fault is in part just say­ing like, ‘Look, there’s this other per­son who had a rea­son­able op­por­tu­nity to solve it and it was a lot smarter than us.’ So the work we do is less likely to make the differ­ence be­tween it be­ing sol­u­ble or not. Be­cause there’s this other smarter per­son.

And then the other thing is like, what should you be aiming for? To the ex­tent there’s a clean prob­lem here which one could hope to solve, or one should bite off as a chunk, what fits in con­cep­tu­ally the same prob­lem ver­sus what’s like– you know, an anal­ogy I some­times make is, if you build an AI that’s do­ing im­por­tant stuff, it might mess up in all sorts of ways. But when you’re ask­ing, ‘Is my AI go­ing to mess up when build­ing a nu­clear re­ac­tor?’ It’s a thing worth rea­son­ing about as an AI per­son, but also like it’s worth split­ting into like– part of that’s an AI prob­lem, and part of that’s a prob­lem about un­der­stand­ing man­ag­ing nu­clear waste. Part of that should be done by peo­ple rea­son­ing about nu­clear waste and part of it should be done by peo­ple rea­son­ing about AI.

This is a lit­tle sub­tle be­cause both of the prob­lems have to do with AI. I would say my re­la­tion­ship with that is similar to like, sup­pose you told me that some fu­ture point, some smart peo­ple might make an AI. There’s just a meta and ob­ject level on which you could hope to help with the prob­lem.

I’m hop­ing to help with the prob­lem on the ob­ject level in the sense that we are go­ing to do re­search which helps peo­ple al­ign AI, and in par­tic­u­lar, will help the fu­ture AI al­ign the next AI. Be­cause it’s like peo­ple. It’s at that level, rather than be­ing like, ‘We’re go­ing to con­struct a con­sti­tu­tion of that AI such that when it builds fu­ture AI it will always definitely work’. This is re­lated to like– there’s this old ar­gu­ment about re­cur­sive self-im­prove­ment. It’s his­tor­i­cally figured a lot in peo­ple’s dis­cus­sion of why the prob­lem is hard, but on a naive per­spec­tive it’s not ob­vi­ous why it should, be­cause you do only a small num­ber of large mod­ifi­ca­tions be­fore your sys­tems are suffi­ciently in­tel­li­gent rel­a­tive to you that it seems like your work should be ob­so­lete. Plus like, them hav­ing a bunch of de­tailed knowl­edge on the ground about what’s go­ing down.

It seems un­clear to me how– yeah, this is re­lated to our dis­agree­ment– how much you’re happy just defer­ring to the fu­ture peo­ple and be­ing like, ‘Hope that they’ll cope’. Maybe they won’t even cope by solv­ing the prob­lem in the same way, they might cope by, the crazy AIs that we built reach the kind of agree­ment that al­lows them to not build even cra­zier AIs in the same way that we might do that. I think there’s some gen­eral frame of, I’m just tak­ing re­spon­si­bil­ity for less, and more say­ing, can we leave the fu­ture peo­ple in a situ­a­tion that is roughly as good as our situ­a­tion? And by fu­ture peo­ple, I mean mostly AIs.

Ronny Fer­nan­dez: Right. The two things that you think might ex­plain your rel­a­tive op­ti­mism are some­thing like: Maybe we can get the prob­lem to smarter agents that are hu­mans. Maybe we can leave the prob­lem to smarter agents that are not hu­mans.

Paul Chris­ti­ano: Also a lot of dis­agree­ment about the prob­lem. Those are cer­tainly two drivers. They’re not ex­haus­tive in the sense that there’s also a huge amount of dis­agree­ment about like, ‘How hard is this prob­lem?’ Which is some com­bi­na­tion of like, ‘How much do we know about it?’ Where they’re more like, ‘Yeah, we’ve thought about it a bunch and have some views.’ And I’m like, ‘I don’t know, I don’t think I re­ally know shit.’ Then part of it is con­cretely there’s a bunch of– on the ob­ject level, there’s a bunch of ar­gu­ments about why it would be hard or easy so we don’t reach agree­ment. We con­sis­tently dis­agree on lots of those points.

Ronny Fer­nan­dez: Do you think the goal state for you guys is the same though? If I gave you guys a bunch of AGIs, would you guys agree about which ones are al­igned and which ones are not? If you could know all of their be­hav­iors?

Paul Chris­ti­ano: I think at that level we’d prob­a­bly agree. We don’t agree more broadly about what con­sti­tutes a win state or some­thing. They have this more ex­pan­sive con­cep­tion– or I guess it’s nar­rower– that the win state is sup­posed to do more. They are imag­in­ing more that you’ve re­solved this whole list of fu­ture challenges. I’m more not count­ing that.

We’ve had this… yeah, I guess I now mostly use in­tent al­ign­ment to re­fer to this prob­lem where there’s risk of am­bi­guity… the prob­lem that I used to call AI al­ign­ment. There was a long ob­nox­ious back and forth about what the al­ign­ment prob­lem should be called. MIRI does use al­igned AI to be like, ‘an AI that pro­duces good out­comes when you run it’. Which I re­ally ob­ject to as a defi­ni­tion of al­igned AI a lot. So if they’re us­ing that as their defi­ni­tion of al­igned AI, we would prob­a­bly dis­agree.

Ronny Fer­nan­dez: Shift­ing terms or what­ever… one thing that they’re try­ing to work on is mak­ing an AGI that has a prop­erty that is also the prop­erty you’re try­ing to make sure that AGI has.

Paul Chris­ti­ano: Yeah, we’re all try­ing to build an AI that’s try­ing to do the right thing.

Ronny Fer­nan­dez: I guess I’m think­ing more speci­fi­cally, for in­stance, I’ve heard peo­ple at MIRI say some­thing like, they want to build an AGI that I can tell it, ‘Hey, figure out how to copy a straw­berry, and don’t mess any­thing else up too badly.’ Does that seem like the same prob­lem that you’re work­ing on?

Paul Chris­ti­ano: I mean it seems like in par­tic­u­lar, you should be able to do that. I think it’s not clear whether that cap­tures all the com­plex­ity of the prob­lem. That’s just sort of a ques­tion about what solu­tions end up look­ing like, whether that turns out to have the same difficulty.

The other things you might think are in­volved that are difficult are… well, I guess one prob­lem is just how you cap­ture com­pet­i­tive­ness. Com­pet­i­tive­ness for me is a key desider­a­tum. And it’s maybe easy to elide in that set­ting, be­cause it just makes a straw­berry. Whereas I am like, if you make a straw­berry liter­ally as well as any­one else can make a straw­berry, it’s just a lit­tle weird to talk about. And it’s a lit­tle weird to even for­mal­ize what com­pet­i­tive­ness means in that set­ting. I think you prob­a­bly can, but whether or not you do that’s not the most nat­u­ral or salient as­pect of the situ­a­tion.

So I prob­a­bly dis­agree with them about– I’m like, there are prob­a­bly lots of ways to have agents that make straw­ber­ries and are very smart. That’s just an­other dis­agree­ment that’s an­other func­tion of the same ba­sic, ’How hard is the prob­lem’ dis­agree­ment. I would guess rel­a­tive to me, in part be­cause of be­ing more pes­simistic about the prob­lem, MIRI is more will­ing to set­tle for an AI that does one thing. And I care more about com­pet­i­tive­ness.

Asya Ber­gal: Say you just learn that pro­saic AI is just not go­ing to be the way we get to AGI. How does that make you feel about the IDA ap­proach ver­sus the MIRI ap­proach?

Paul Chris­ti­ano: So my over­all stance when I think about al­ign­ment is, there’s a bunch of pos­si­ble al­gorithms that you could use. And the game is un­der­stand­ing how to al­ign those al­gorithms. And it’s kind of a differ­ent game. There’s a lot of com­mon sub­prob­lems in be­tween differ­ent al­gorithms you might want to al­ign, it’s po­ten­tially a differ­ent game for differ­ent al­gorithms. That’s an im­por­tant part of the an­swer. I’m mostly fo­cus­ing on the ‘al­ign this par­tic­u­lar’– I’ll call it learn­ing, but it’s a lit­tle bit more spe­cific than learn­ing– where you search over poli­cies to find a policy that works well in prac­tice. If we’re not do­ing that, then maybe that solu­tion is to­tally use­less, maybe it has com­mon sub­prob­lems with the solu­tion you ac­tu­ally need. That’s one part of the an­swer.

Another big differ­ence is go­ing to be, timelines views will shift a lot if you’re handed that in­for­ma­tion. So it will de­pend ex­actly on the na­ture of the up­date. I don’t have a strong view about whether it makes my timelines shorter or longer over­all. Maybe you should bracket that though.

In terms of re­turn­ing to the first one of try­ing to al­ign par­tic­u­lar al­gorithms, I don’t know. I think I prob­a­bly share some of the MIRI persp– well, no. It feels to me like there’s a lot of com­mon sub­prob­lems. Align­ing ex­pert sys­tems seems like it would in­volve a lot of the same rea­son­ing as al­ign­ing learn­ers. To the ex­tent that’s true, prob­a­bly fu­ture stuff also will in­volve a lot of the same sub­prob­lems, but I doubt the al­gorithm will look the same. I also doubt the ac­tual al­gorithm will look any­thing like a par­tic­u­lar pseu­docode we might write down for iter­ated am­plifi­ca­tion now.

Asya Ber­gal: Does iter­ated am­plifi­ca­tion in your mind rely on this thing that searches through poli­cies for the best policy? The way I un­der­stand it, it doesn’t feel like it nec­es­sar­ily does.

Paul Chris­ti­ano: So, you use this dis­til­la­tion step. And the rea­son you want to do am­plifi­ca­tion, or this short-hop, ex­pen­sive am­plifi­ca­tion, is be­cause you in­ter­leave it with this dis­til­la­tion step. And I nor­mally imag­ine the dis­til­la­tion step as be­ing, learn a thing which works well in prac­tice on a re­ward func­tion defined by the over­seer. You could imag­ine other things that also needed to have this frame­work, but it’s not ob­vi­ous whether you need this step if you didn’t some­how get granted some­thing like the–

Asya Ber­gal: That you could do the dis­til­la­tion step some­how.

Paul Chris­ti­ano: Yeah. It’s un­clear what else would– so an­other ex­am­ple of a thing that could fit in, and this maybe makes it seem more gen­eral, is if you had an agent that was just in­cen­tivized to make lots of money. Then you could just have your dis­til­la­tion step be like, ‘I ran­domly check the work of this per­son, and com­pen­sate them based on the work I checked’. That’s a sug­ges­tion of how this frame­work could end up be­ing more gen­eral.

But I mostly do think about it in the con­text of learn­ing in par­tic­u­lar. I think it’s rel­a­tively likely to change if you’re not in that set­ting. Well, I don’t know. I don’t have a strong view. I’m mostly just work­ing in that set­ting, mostly be­cause it seems rea­son­ably likely, seems rea­son­ably likely to have a bunch in com­mon, learn­ing is rea­son­ably likely to ap­pear even if other tech­niques ap­pear. That is, learn­ing is likely to play a part in pow­er­ful AI even if other tech­niques also play a part.

Asya Ber­gal: Are there other peo­ple or re­sources that you think would be good for us to look at if we were look­ing at the op­ti­mism view?

Paul Chris­ti­ano: Be­fore we get to re­sources or peo­ple, I think one of the ba­sic ques­tions is, there’s this per­spec­tive which is fairly com­mon in ML, which is like, ‘We’re kind of just go­ing to do a bunch of stuff, and it’ll prob­a­bly work out’. That’s prob­a­bly the ba­sic thing to be get­ting at. How right is that?

This is the bad view of safety con­di­tioned on– I feel like pro­saic AI is in some sense the worst– seems like about as bad as things would have got­ten in terms of al­ign­ment. Where, I don’t know, you try a bunch of shit, just a ton of stuff, a ton of trial and er­ror seems pretty bad. Any­way, this is a ran­dom aside maybe more re­lated to the pre­vi­ous point. But yeah, this is just with al­ign­ment. There’s this view in ML that’s rel­a­tively com­mon that’s like, we’ll try a bunch of stuff to get the AI to do what we want, it’ll prob­a­bly work out. Some prob­lems will come up. We’ll prob­a­bly solve them. I think that’s prob­a­bly the most im­por­tant thing in the op­ti­mism vs pes­simism side.

And I don’t know, I mean this has been a pro­ject that like, it’s a hard pro­ject. I think the cur­rent state of af­fairs is like, the MIRI folk have strong in­tu­itions about things be­ing hard. Essen­tially no one in… very few peo­ple in ML agree with those, or even un­der­stand where they’re com­ing from. And even peo­ple in the EA com­mu­nity who have tried a bunch to un­der­stand where they’re com­ing from mostly don’t. Mostly peo­ple ei­ther end up un­der­stand­ing one side or the other and don’t re­ally feel like they’re able to con­nect ev­ery­thing. So it’s an in­timi­dat­ing pro­ject in that sense. I think the MIRI peo­ple are the main pro­po­nents of the ev­ery­thing is doomed, the peo­ple to talk to on that side. And then in some sense there’s a lot of peo­ple on the other side who you can talk to, and the ques­tion is just, who can ar­tic­u­late the view most clearly? Or who has most en­gaged with the MIRI view such that they can speak to it?

Ronny Fer­nan­dez: Those are peo­ple I would be par­tic­u­larly in­ter­ested in. If there are peo­ple that un­der­stand all the MIRI ar­gu­ments but still have broadly the per­spec­tive you’re de­scribing, like some prob­lems will come up, prob­a­bly we’ll fix them.

Paul Chris­ti­ano: I don’t know good– I don’t have good ex­am­ples of peo­ple for you. I think most peo­ple just find the MIRI view kind of in­com­pre­hen­si­ble, or like, it’s a re­ally com­pli­cated thing, even if the MIRI view makes sense in its face. I don’t think peo­ple have got­ten enough into the weeds. It re­ally rests a lot right now on this fairly com­pli­cated cluster of in­tu­itions. I guess on the ob­ject level, I think I’ve just en­gaged a lot more with the MIRI view than most peo­ple who are– who mostly take the ‘ev­ery­thing will be okay’ per­spec­tive. So happy to talk on the ob­ject level, and speak­ing more to ar­gu­ments. I think it’s a hard thing to get into, but it’s go­ing to be even harder to find other peo­ple in ML who have en­gaged with the view that much.

They might be able to make other gen­eral crit­i­cisms of like, here’s why I haven’t re­ally… like it doesn’t seem like a promis­ing kind of view to think about. I think you could find more peo­ple who have en­gaged at that level. I don’t know who I would recom­mend ex­actly, but I could think about it. Prob­a­bly a big ques­tion will be who is ex­cited to talk to you about it.

Asya Ber­gal: I am cu­ri­ous about your re­sponse to MIRI’s ob­ject level ar­gu­ments. Is there a place that ex­ists some­where?

Paul Chris­ti­ano: There’s some back and forth on the in­ter­net. I don’t know if it’s great. There’s some LessWrong posts. Eliezer for ex­am­ple wrote this post about why things were doomed, why I in par­tic­u­lar was doomed. I don’t know if you read that post.

Asya Ber­gal: I can also ask you about it now, I just don’t want to take too much of your time if it’s a huge body of things.

Paul Chris­ti­ano: The ba­sic ar­gu­ment would be like, 1) On pa­per I don’t think we yet have a good rea­son to feel doomy. And I think there’s some ba­sic re­search in­tu­ition about how much a prob­lem– sup­pose you poke at a prob­lem a few times, and you’re like ‘Agh, seems hard to make progress’. How much do you in­fer that the prob­lem’s re­ally hard? And I’m like, not much. As a per­son who’s poked at a bunch of prob­lems, let me tell you, that of­ten doesn’t work and then you solve in like 10 years of effort.

So that’s one thing. That’s a point where I have rel­a­tively lit­tle sym­pa­thy for the MIRI way. That’s one set of ar­gu­ments: is there a good way to get trac­tion on this prob­lem? Are there clever al­gorithms? I’m like, I don’t know, I don’t feel like the kind of ev­i­dence we’ve seen is the kind of ev­i­dence that should be per­sua­sive. As some ev­i­dence in that di­rec­tion, I’d be like, I have not been think­ing about this that long. I feel like there have of­ten been things that felt like, or that MIRI would have defended as like, here’s a hard ob­struc­tion. Then you think about it and you’re ac­tu­ally like, ‘Here are some things you can do.’ And it may still be a ob­struc­tion, but it’s no longer quite so ob­vi­ous where it is, and there were av­enues of at­tack.

That’s one thing. The sec­ond thing is like, a metaphor that makes me feel good– MIRI talks a lot about the evolu­tion anal­ogy. If I imag­ine the evolu­tion prob­lem– so if I’m a per­son, and I’m breed­ing some an­i­mals, I’m breed­ing some su­per­in­tel­li­gence. Sup­pose I wanted to breed an an­i­mal mod­estly smarter than hu­mans that is re­ally docile and friendly. I’m like, I don’t know man, that seems like it might work. That’s where I’m at. I think they are… it’s been a lit­tle bit hard to track down this dis­agree­ment, and I think this is maybe in a fresher, rawer state than the other stuff, where we haven’t had enough back and forth.

But I’m like, it doesn’t sound nec­es­sar­ily that hard. I just don’t know. I think their po­si­tion, their po­si­tion when they’ve writ­ten some­thing has been a lit­tle bit more like, ‘But you couldn’t breed a thing, that af­ter un­der­go­ing rad­i­cal changes in in­tel­li­gence or situ­a­tion would re­main friendly’. But then I’m nor­mally like, but it’s not clear why that’s needed? I would re­ally just like to cre­ate some­thing slightly su­per­hu­man, and it’s go­ing to work with me to breed some­thing that’s slightly smarter still that is friendly.

We haven’t re­ally been able to get trac­tion on that. I think they have an in­tu­ition that maybe there’s some kind of in­var­i­ance and things be­come grad­u­ally more un­rav­eled as you go on. Whereas I have more in­tu­ition that it’s plau­si­ble. After this gen­er­a­tion, there’s just smarter and smarter peo­ple think­ing about how to keep ev­ery­thing on the rails. It’s very hard to know.

That’s the sec­ond thing. I have found that re­ally… that feels like it gets to the heart of some in­tu­itions that are very differ­ent, and I don’t un­der­stand what’s up there. There’s a third cat­e­gory which is like, on the ob­ject level, there’s a lot of di­rec­tions that I’m en­thu­si­as­tic about where they’re like, ‘That seems ob­vi­ously doomed’. So you could di­vide those up into the two prob­lems. There’s the fam­ily of prob­lems that are more like the in­ner al­ign­ment prob­lem, and then outer al­ign­ment stuff.

On the in­ner al­ign­ment stuff, I haven’t thought that much about it, but ex­am­ples of things that I’m op­ti­mistic about that they’re su­per pes­simistic about are like, stuff that looks more like ver­ifi­ca­tion, or maybe step­ping back even for that, there’s this ba­sic paradigm of ad­ver­sar­ial train­ing, where I’m like, it seems close to work­ing. And you could imag­ine it be­ing like, it’s just a re­search prob­lem to fill in the gaps. Whereas they’re like, that’s so not the kind of thing that would work. I don’t re­ally know where we’re at with that. I do see there are for­mal ob­struc­tions to ad­ver­sar­ial train­ing in par­tic­u­lar work­ing. I’m like, I see why this is not yet a solu­tion. For ex­am­ple, you can have this case where there’s a pred­i­cate that the model checks, and it’s easy to check but hard to con­struct ex­am­ples. And then in your ad­ver­sar­ial train­ing you can’t ever feed an ex­am­ple where it’ll fail. So we get into like, is it plau­si­ble that you can han­dle that prob­lem with ei­ther 1) Do­ing some­thing more like ver­ifi­ca­tion, where you say, you ask them not to perform well on real in­puts but on pseudo in­puts. Or like, you ask the at­tacker just to show how it’s con­ceiv­able that the model could do a bad thing in some sense.

That’s one pos­si­ble ap­proach, where the other would be some­thing more like in­ter­pretabil­ity, where you say like, ‘Here’s what the model is do­ing. In ad­di­tion to it’s be­hav­ior we get this other sig­nal that the pa­per was de­pend­ing on this fact, its pred­i­cate paths, which it shouldn’t have been de­pen­dent on.’ The ques­tion is, can ei­ther of those yield good be­hav­ior? I’m like, I don’t know, man. It seems plau­si­ble. And they’re like ‘Definitely not.’ And I’m like, ‘Why definitely not?’ And they’re like ‘Well, that’s not get­ting at the real essence of the prob­lem.’ And I’m like ‘Okay, great, but how did you sub­stan­ti­ate this no­tion of the real essence of the prob­lem? Where is that com­ing from? Is that com­ing from a whole bunch of other solu­tions that look plau­si­ble that failed?’ And their take is kind of like, yes, and I’m like, ‘But none of those– there weren’t ac­tu­ally even any can­di­date solu­tions there re­ally that failed yet. You’ve got maybe one thing, or like, you showed there ex­ists a prob­lem in some min­i­mal sense.’ This comes back to the first of the three things I listed. But it’s a lit­tle bit differ­ent in that I think you can just stare at par­tic­u­lar things and they’ll be like, ‘Here’s how that par­tic­u­lar thing is go­ing to fail.’ And I’m like ‘I don’t know, it seems plau­si­ble.’

That’s on in­ner al­ign­ment. And there’s maybe some on outer al­ign­ment. I feel like they’ve given a lot of ground in the last four years on how doomy things seem on outer al­ign­ment. I think they still have some– if we’re talk­ing about am­plifi­ca­tion, I think the po­si­tion would still be, ‘Man, why would that agent be al­igned? It doesn’t at all seem like it would be al­igned.’ That has also been a lit­tle bit sur­pris­ingly tricky to make progress on. I think it’s similar, where I’m like, yeah, I grant the ex­is­tence of some prob­lem or some thing which needs to be es­tab­lished, but I don’t grant– I think their po­si­tion would be like, this hasn’t made progress or just like, pushed around the core difficulty. I’m like, I don’t grant the con­cep­tion of the core difficulty in which this has just pushed around the core difficulty. I think that… sub­stan­tially in that kind of thing, be­ing like, here’s an ap­proach that seems plau­si­ble, we don’t have a clear ob­struc­tion but I think that it is doomed for these deep rea­sons. I have maybe a higher bar for what kind of sup­port the deep rea­sons need.

I also just think on the mer­its, they have not re­ally en­gaged with– and this is partly my re­spon­si­bil­ity for not hav­ing ar­tic­u­lated the ar­gu­ments in a clear enough way– al­though I think they have not en­gaged with even the clear­est ar­tic­u­la­tion as of two years ago of what the hope was. But that’s prob­a­bly on me for not hav­ing an even clearer ar­tic­u­la­tion than that, and also definitely not up to them to en­gage with any­thing. To the ex­tent it’s a mov­ing tar­get, not up to them to en­gage with the most re­cent ver­sion. Where, most re­cent ver­sion– the pro­posal doesn’t re­ally change that much, or like, the case for op­ti­mism has changed a lit­tle bit. But it’s mostly just like, the state of ar­gu­ment con­cern­ing it, rather than the ver­sion of the scheme.