# Clarifying “AI Alignment”

When I say an AI A is al­igned with an op­er­a­tor H, I mean:

A is try­ing to do what H wants it to do.

The “al­ign­ment prob­lem” is the prob­lem of build­ing pow­er­ful AI sys­tems that are al­igned with their op­er­a­tors.

This is sig­nifi­cantly nar­rower than some other defi­ni­tions of the al­ign­ment prob­lem, so it seems im­por­tant to clar­ify what I mean.

In par­tic­u­lar, this is the prob­lem of get­ting your AI to try to do the right thing, not the prob­lem of figur­ing out which thing is right. An al­igned AI would try to figure out which thing is right, and like a hu­man it may or may not suc­ceed.

## Analogy

Con­sider a hu­man as­sis­tant who is try­ing their hard­est to do what H wants.

I’d say this as­sis­tant is al­igned with H. If we build an AI that has an analo­gous re­la­tion­ship to H, then I’d say we’ve solved the al­ign­ment prob­lem.

“Aligned” doesn’t mean “perfect:”

• They could mi­s­un­der­stand an in­struc­tion, or be wrong about what H wants at a par­tic­u­lar mo­ment in time.

• They may not know ev­ery­thing about the world, and so fail to rec­og­nize that an ac­tion has a par­tic­u­lar bad side effect.

• They may not know ev­ery­thing about H’s prefer­ences, and so fail to rec­og­nize that a par­tic­u­lar side effect is bad.

• They may build an un­al­igned AI (while at­tempt­ing to build an al­igned AI).

I use al­ign­ment as a state­ment about the mo­tives of the as­sis­tant, not about their knowl­edge or abil­ity. Im­prov­ing their knowl­edge or abil­ity will make them a bet­ter as­sis­tant — for ex­am­ple, an as­sis­tant who knows ev­ery­thing there is to know about H is less likely to be mis­taken about what H wants — but it won’t make them more al­igned.

(For very low ca­pa­bil­ities it be­comes hard to talk about al­ign­ment. For ex­am­ple, if the as­sis­tant can’t rec­og­nize or com­mu­ni­cate with H, it may not be mean­ingful to ask whether they are al­igned with H.)

## Clarifications

• The defi­ni­tion is in­tended de dicto rather than de re. An al­igned A is try­ing to “do what H wants it to do.” Sup­pose A thinks that H likes ap­ples, and so goes to the store to buy some ap­ples, but H re­ally prefers or­anges. I’d call this be­hav­ior al­igned be­cause A is try­ing to do what H wants, even though the thing it is try­ing to do (“buy ap­ples”) turns out not to be what H wants: the de re in­ter­pre­ta­tion is false but the de dicto in­ter­pre­ta­tion is true.

• An al­igned AI can make er­rors, in­clud­ing moral or psy­cholog­i­cal er­rors, and fix­ing those er­rors isn’t part of my defi­ni­tion of al­ign­ment ex­cept in­so­far as it’s part of get­ting the AI to “try to do what H wants” de dicto. This is a crit­i­cal differ­ence be­tween my defi­ni­tion and some other com­mon defi­ni­tions. I think that us­ing a broader defi­ni­tion (or the de re read­ing) would also be defen­si­ble, but I like it less be­cause it in­cludes many sub­prob­lems that I think (a) are much less ur­gent, (b) are likely to in­volve to­tally differ­ent tech­niques than the ur­gent part of al­ign­ment.

• An al­igned AI would also be try­ing to do what H wants with re­spect to clar­ify­ing H’s prefer­ences. For ex­am­ple, it should de­cide whether to ask if H prefers ap­ples or or­anges, based on its best guesses about how im­por­tant the de­ci­sion is to H, how con­fi­dent it is in its cur­rent guess, how an­noy­ing it would be to ask, etc. Of course, it may also make a mis­take at the meta level — for ex­am­ple, it may not un­der­stand when it is OK to in­ter­rupt H, and there­fore avoid ask­ing ques­tions that it would have been bet­ter to ask.

• This defi­ni­tion of “al­ign­ment” is ex­tremely im­pre­cise. I ex­pect it to cor­re­spond to some more pre­cise con­cept that cleaves re­al­ity at the joints. But that might not be­come clear, one way or the other, un­til we’ve made sig­nifi­cant progress.

• One rea­son the defi­ni­tion is im­pre­cise is that it’s un­clear how to ap­ply the con­cepts of “in­ten­tion,” “in­cen­tive,” or “mo­tive” to an AI sys­tem. One naive ap­proach would be to equate the in­cen­tives of an ML sys­tem with the ob­jec­tive it was op­ti­mized for, but this seems to be a mis­take. For ex­am­ple, hu­mans are op­ti­mized for re­pro­duc­tive fit­ness, but it is wrong to say that a hu­man is in­cen­tivized to max­i­mize re­pro­duc­tive fit­ness.

• “What H wants” is even more prob­le­matic than “try­ing.” Clar­ify­ing what this ex­pres­sion means, and how to op­er­a­tional­ize it in a way that could be used to in­form an AI’s be­hav­ior, is part of the al­ign­ment prob­lem. Without ad­di­tional clar­ity on this con­cept, we will not be able to build an AI that tries to do what H wants it to do.

## Postscript on ter­minolog­i­cal history

I origi­nally de­scribed this prob­lem as part of “the AI con­trol prob­lem,” fol­low­ing Nick Bostrom’s us­age in Su­per­in­tel­li­gence, and used “the al­ign­ment prob­lem” to mean “un­der­stand­ing how to build AI sys­tems that share hu­man prefer­ences/​val­ues” (which would in­clude efforts to clar­ify hu­man prefer­ences/​val­ues).

I adopted the new ter­minol­ogy af­ter some peo­ple ex­pressed con­cern with “the con­trol prob­lem.” There is also a slight differ­ence in mean­ing: the con­trol prob­lem is about cop­ing with the pos­si­bil­ity that an AI would have differ­ent prefer­ences from its op­er­a­tor. Align­ment is a par­tic­u­lar ap­proach to that prob­lem, namely avoid­ing the prefer­ence di­ver­gence al­to­gether (so ex­clud­ing tech­niques like “put the AI in a re­ally se­cure box so it can’t cause any trou­ble”). There cur­rently seems to be a ten­ta­tive con­sen­sus in fa­vor of this ap­proach to the con­trol prob­lem.

I don’t have a strong view about whether “al­ign­ment” should re­fer to this prob­lem or to some­thing differ­ent. I do think that some term needs to re­fer to this prob­lem, to sep­a­rate it from other prob­lems like “un­der­stand­ing what hu­mans want,” “solv­ing philos­o­phy,” etc.

• Crys­tal­lized my view of what the “core prob­lem” is (as I ex­plained in a com­ment on this post). I think I had in­tu­itions of this form be­fore, but at the very least this post clar­ified them.

• In this es­say Paul Chris­ti­ano pro­poses a defi­ni­tion of “AI al­ign­ment” which is more nar­row than other defi­ni­tions that are of­ten em­ployed. Speci­fi­cally, Paul sug­gests defin­ing al­ign­ment in terms of the mo­ti­va­tion of the agent (which should be, helping the user), rather than what the agent ac­tu­ally does. That is, as long as the agent “means well”, it is al­igned, even if er­rors in its as­sump­tions about the user’s prefer­ences or about the world at large lead it to ac­tions that are bad for the user.

Ro­hin Shah’s com­ment on the es­say (which I be­lieve is en­dorsed by Paul) re­frames it as a par­tic­u­lar way to de­com­pose the AI safety prob­lem. An of­ten used de­com­po­si­tion is “defi­ni­tion-op­ti­miza­tion”: first we define what it means for an AI to be safe, then we un­der­stand how to im­ple­ment a safe AI. In con­trast, Paul’s defi­ni­tion of al­ign­ment de­com­poses the AI safety prob­lem as “mo­ti­va­tion-com­pe­tence”: first we learn how to de­sign AIs with good mo­ti­va­tions, then we learn how to make them com­pe­tent. Both Paul and Ro­hin ar­gue that the “mo­ti­va­tion” is the ur­gent part of the prob­lem, the part on which tech­ni­cal AI safety re­search should fo­cus.

In con­trast, I will ar­gue that the “mo­ti­va­tion-com­pe­tence” de­com­po­si­tion is not as use­ful as Paul and Ro­hin be­lieve, and the “defi­ni­tion-op­ti­miza­tion” de­com­po­si­tion is more use­ful.

The the­sis be­hind the “mo­ti­va­tion-com­pe­tence” de­com­po­si­tion im­plic­itly as­sumes a lin­ear, one-di­men­sional scale of com­pe­tence. Agents with good mo­ti­va­tions and sub­hu­man com­pe­tence might make silly mis­takes but are not catas­troph­i­cally dan­ger­ous (since they are sub­hu­man). Agents with good mo­ti­va­tions and su­per­hu­man com­pe­tence will only do mis­takes that are “for­giv­able” in the sense that, our own mis­takes would be as bad or worse. Ergo (the the­sis con­cludes), good mo­ti­va­tions are suffi­cient to solve AI safety.

How­ever, in re­al­ity com­pe­tence is multi-di­men­sional. AI sys­tems can have sub­hu­man skills in some do­mains and su­per­hu­man skills in other do­mains, as AI his­tory showed time and time again. This opens the pos­si­bil­ity of agents that with “well in­ten­tioned” mis­takes that take the form of so­phis­ti­cated plans that are catas­trophic for the user. More­over, there might be limits to the agent’s knowl­edge about cer­tain ques­tions (such as, the user’s prefer­ences) that are in­her­ent in the agent’s episte­mol­ogy (more on this be­low). Given such limits, the agent’s com­pe­tence be­comes sys­tem­at­i­cally lop­sided. Fur­ther­more, the elimi­na­tion of such limits is as a large part of the “defi­ni­tion” part in the “defi­ni­tion-op­ti­miza­tion” fram­ing that the the­sis re­jects.

As a con­se­quence of the multi-di­men­sional nat­u­ral of com­pe­tence, the differ­ence be­tween “well in­ten­tioned mis­take” and “mal­i­cious sab­o­tage” is much less clear than naively as­sumed, and I’m not con­vinced there is a nat­u­ral way to re­move the am­bi­guity. For ex­am­ple, con­sider a su­per­hu­man AI Alpha sub­ject to an acausal at­tack. In this sce­nario, some agent Beta in the “mul­ti­verse” (= prior) con­vinces Alpha that Alpha ex­ists in a simu­la­tion con­trol­led by Beta. The simu­la­tion is set up to look like the real Earth for a while, mak­ing it a plau­si­ble hy­poth­e­sis. Then, a “treach­er­ous turn” mo­ment ar­rives in which the simu­la­tion di­verges from Earth, in a way calcu­lated to make Alpha take ir­re­versible ac­tions that are benefi­cial for Beta and dis­as­trous for the user.

In the above sce­nario, is Alpha “mo­ti­va­tion-al­igned”? We could ar­gue it is not, be­cause it is run­ning the mal­i­cious agent Beta. But we could also ar­gue it is mo­tiv­tion-al­igned, it just makes the in­no­cent mis­take of fal­ling for Beta’s trick. Per­haps it is pos­si­ble to clar­ify the con­cept of “mo­ti­va­tion” such that in this case, Alpha’s mo­ti­va­tions are con­sid­ered bad. But, such a con­cept would de­pend in com­pli­cated ways on the agent’s in­ter­nals. I think that this is a difficult and un­nat­u­ral ap­proach, com­pared to “defi­ni­tion-op­ti­miza­tion” where the fo­cus is not on the in­ter­nals but on what the agent ac­tu­ally does (more on this later).

The pos­si­bil­ity of acausal at­tacks is a symp­tom of the fact that, en­vi­ron­ments with ir­re­versible tran­si­tions are usu­ally not learn­able (this is the prob­lem of traps in re­in­force­ment learn­ing, that I dis­cussed for ex­am­ple here and here), i.e. it is im­pos­si­ble to guaran­tee con­ver­gence to op­ti­mal ex­pected util­ity with­out fur­ther as­sump­tions. When we add prefer­ence learn­ing to the mix, the prob­lem gets worse be­cause now even if there are no ir­re­versible tran­si­tions, it is not clear the agent will con­verge to op­ti­mal util­ity. In­deed, de­pend­ing on the value learn­ing pro­to­col, there might be un­cer­tain­ties about the user’s prefer­ences that the agent can never re­solve (this is an ex­am­ple of what I meant by “in­her­ent limits” be­fore). For ex­am­ple, this hap­pens in CIRL (even if the user is perfectly ra­tio­nal, this hap­pens be­cause the user and the AI have differ­ent ac­tion sets).

Th­ese difficul­ties with the “mo­ti­va­tion-com­pe­tence” fram­ing are much more nat­u­ral to han­dle in the “defi­ni­tion-op­ti­miza­tion” fram­ing. More­over, the lat­ter already pro­duced vi­able di­rec­tions for math­e­mat­i­cal for­mal­iza­tion, and the former has not (AFAIK). Speci­fi­cally, the math­e­mat­i­cal crite­ria of al­ign­ment I pro­posed are the “dy­namic sub­jec­tive re­gret bound” and the “dan­ger­ous­ness bound”. The former is a crite­rion which si­mul­ta­neous guaran­tees mo­ti­va­tion-al­ign­ment and com­pe­tence (as ev­i­dence that this crite­rion can be satis­fied, I have the Dialogic Re­in­force­ment Learn­ing pro­posal). The lat­ter is a crite­rion that doesn’t guaran­tee com­pe­tence in gen­eral, but guaran­tees speci­fi­cally avoid­ing catas­trophic mis­takes. This makes it closer to mo­ti­va­tion-al­ign­ment com­pated to sub­jec­tive re­gret, but differ­ent in im­por­tant ways: it refers to the ac­tual things that agent does, and the ways in which these things might have catas­trophic con­se­quences.

In sum­mary, I am skep­ti­cal that “mo­ti­va­tion” and “com­pe­tence” can be cleanly sep­a­rately in a way that is use­ful for AI safety, whereas “defi­ni­tion” and “op­ti­miza­tion” can be so sep­a­rated: for ex­am­ple the dy­namic sub­jec­tive re­gret bound is a “defi­ni­tion” whereas di­alogic RL and pu­ta­tive more con­crete im­ple­men­ta­tions thereof are “op­ti­miza­tions”. My spe­cific pro­pos­als might have fatal flaws that weren’t dis­cov­ered yet, but I be­lieve that the gen­eral prin­ci­ple of “defi­ni­tion-op­ti­miza­tion” is sound, while “mo­ti­va­tion-com­pe­tence” is not.

• This opens the pos­si­bil­ity of agents that with “well in­ten­tioned” mis­takes that take the form of so­phis­ti­cated plans that are catas­trophic for the user.

Agreed that this is in the­ory pos­si­ble, but it would be quite sur­pris­ing, es­pe­cially if we are speci­fi­cally aiming to train sys­tems that be­have cor­rigibly.

In the above sce­nario, is Alpha “mo­ti­va­tion-al­igned”

If Alpha can pre­dict that the user would say not to do the ir­re­versible ac­tion, then at the very least it isn’t cor­rigible, and it would be rather hard to ar­gue that it is in­tent al­igned.

But, such a con­cept would de­pend in com­pli­cated ways on the agent’s in­ter­nals.

That, or it could de­pend on the agent’s coun­ter­fac­tual be­hav­ior in other situ­a­tions. I agree it can’t be just the ac­tion cho­sen in the par­tic­u­lar state.

More­over, the lat­ter already pro­duced vi­able di­rec­tions for math­e­mat­i­cal for­mal­iza­tion, and the former has not (AFAIK).

I guess you wouldn’t count uni­ver­sal­ity. Over­all I agree. I’m rel­a­tively pes­simistic about math­e­mat­i­cal for­mal­iza­tion. (Prob­a­bly not worth de­bat­ing this point; feels like peo­ple have talked about it at length in Real­ism about ra­tio­nal­ity with­out mak­ing much progress.)

it refers to the ac­tual things that agent does, and the ways in which these things might have catas­trophic con­se­quences.

I do want to note that all of these re­quire you to make as­sump­tions of the form, “if there are traps, ei­ther the user or the agent already knows about them” and so on, in or­der to avoid no-free-lunch the­o­rems.

• This opens the pos­si­bil­ity of agents that with “well in­ten­tioned” mis­takes that take the form of so­phis­ti­cated plans that are catas­trophic for the user.

Agreed that this is in the­ory pos­si­ble, but it would be quite sur­pris­ing, es­pe­cially if we are speci­fi­cally aiming to train sys­tems that be­have cor­rigibly.

The acausal at­tack is an ex­am­ple of how it can hap­pen for sys­tem­atic rea­sons. As for the other part, that seems like con­ced­ing that in­tent-al­ign­ment is in­suffi­cient and you need “cor­rigi­bil­ity” as an­other con­di­tion (also it is not so clear to me what this con­di­tion means).

If Alpha can pre­dict that the user would say not to do the ir­re­versible ac­tion, then at the very least it isn’t cor­rigible, and it would be rather hard to ar­gue that it is in­tent al­igned.

It is pos­si­ble that Alpha can­not pre­dict it, be­cause in Beta-simu­la­tion-world the user would con­firm the ir­re­versible ac­tion. It is also pos­si­ble that the user would con­firm the ir­re­versible ac­tion in the real world be­cause the user is be­ing ma­nipu­lated, and what­ever defenses we put in place against ma­nipu­la­tion are thrown off by the simu­la­tion hy­poth­e­sis.

Now, I do be­lieve that if you set up the prior cor­rectly then it won’t hap­pen, thanks to a mechanism like: Alpha knows that in case of dan­ger­ous un­cer­tainty it is safe to fall back on some “neu­tral” course of ac­tion plus query the user (in spe­cific, safe, ways). But this ex­actly shows that in­tent-al­ign­ment is not enough and you need fur­ther as­sump­tions.

More­over, the lat­ter already pro­duced vi­able di­rec­tions for math­e­mat­i­cal for­mal­iza­tion, and the former has not (AFAIK).

I guess you wouldn’t count uni­ver­sal­ity. Over­all I agree.

Be­sides the fact as­crip­tion uni­ver­sal­ity is not for­mal­ized, why is it equiv­a­lent to in­tent-al­ign­ment? Maybe I’m miss­ing some­thing.

I’m rel­a­tively pes­simistic about math­e­mat­i­cal for­mal­iza­tion.

I am cu­ri­ous whether you can spec­ify, as con­cretely as pos­si­ble, what type of math­e­mat­i­cal re­sult would you have to see in or­der to sig­nifi­cantly up­date away from this opinion.

I do want to note that all of these re­quire you to make as­sump­tions of the form, “if there are traps, ei­ther the user or the agent already knows about them” and so on, in or­der to avoid no-free-lunch the­o­rems.

No, I make no such as­sump­tion. A bound on sub­jec­tive re­gret en­sures that run­ning the AI is a nearly-op­ti­mal strat­egy from the user’s sub­jec­tive per­spec­tive. It is nei­ther needed nor pos­si­ble to prove that the AI can never en­ter a trap. For ex­am­ple, the AI is im­mune to acausal at­tacks to the ex­tent that the user be­liefs that the AI is not in­side Beta’s simu­la­tion. On the other hand, if the user be­liefs that the simu­la­tion hy­poth­e­sis needs to be taken into ac­count, then the sce­nario amounts to le­gi­t­i­mate acausal bar­gain­ing (which has its own com­pli­ca­tions to do with de­ci­sion/​game the­ory, but that’s mostly a sep­a­rate con­cern).

• A bound on sub­jec­tive re­gret en­sures that run­ning the AI is a nearly-op­ti­mal strat­egy from the user’s sub­jec­tive per­spec­tive.

Sorry, that’s right. Fwiw, I do think sub­jec­tive re­gret bounds are sig­nifi­cantly bet­ter than the thing I meant by defi­ni­tion-op­ti­miza­tion.

It is pos­si­ble that Alpha can­not pre­dict it, be­cause in Beta-simu­la­tion-world the user would con­firm the ir­re­versible ac­tion. It is also pos­si­ble that the user would con­firm the ir­re­versible ac­tion in the real world be­cause the user is be­ing ma­nipu­lated, and what­ever defenses we put in place against ma­nipu­la­tion are thrown off by the simu­la­tion hy­poth­e­sis.

Why doesn’t this also ap­ply to sub­jec­tive re­gret bounds?

My guess at your an­swer is that Alpha wouldn’t take the ir­re­versible ac­tion as long as the user be­lieves that Alpha is not in Beta-simu­la­tion-world. I would amend that to say that Alpha has to know that [the user doesn’t be­lieve that Alpha is in Beta-simu­la­tion-world]. But if Alpha knows that, then surely Alpha can pre­dict that the user would not con­firm the ir­re­versible ac­tion?

It seems like for sub­jec­tive re­gret bounds, avoid­ing this sce­nario de­pends on your prior already “know­ing” that the user thinks that Alpha is not in Beta-simu­la­tion-world (per­haps by ex­clud­ing Beta-simu­la­tions). If that’s true, you could do the same thing with in­tent al­ign­ment /​ cor­rigi­bil­ity.

Be­sides the fact as­crip­tion uni­ver­sal­ity is not for­mal­ized, why is it equiv­a­lent to in­tent-al­ign­ment? Maybe I’m miss­ing some­thing.

It isn’t equiv­a­lent to in­tent al­ign­ment; but it is meant to be used as part of an ar­gu­ment for safety, though I guess it could be used in defi­ni­tion-op­ti­miza­tion too, so never mind.

I am cu­ri­ous whether you can spec­ify, as con­cretely as pos­si­ble, what type of math­e­mat­i­cal re­sult would you have to see in or­der to sig­nifi­cantly up­date away from this opinion.

That is hard to say. I would want to have the re­ac­tion “oh, if I built that sys­tem, I ex­pect it to be safe and com­pet­i­tive”. Most ex­ist­ing math­e­mat­i­cal re­sults do not seem to be com­pet­i­tive, as they get their guaran­tees by do­ing some­thing that in­volves a search over the en­tire hy­poth­e­sis space.

I could also imag­ine be­ing pretty in­ter­ested in a math­e­mat­i­cal defi­ni­tion of safety that I thought ac­tu­ally cap­tured “safety” with­out “pass­ing the buck”. I think sub­jec­tive re­gret bounds and CIRL both make some progress on this, but some­what “pass the buck” by re­quiring a well-speci­fied hy­poth­e­sis space for re­wards /​ be­liefs /​ ob­ser­va­tion mod­els.

Tbc, I also don’t think in­tent al­ign­ment will lead to a math­e­mat­i­cal for­mal­iza­tion I’m happy with—it “passes the buck” to the prob­lem of defin­ing what “try­ing” is, or what “cor­rigi­bil­ity” is.

• It is pos­si­ble that Alpha can­not pre­dict it, be­cause in Beta-simu­la­tion-world the user would con­firm the ir­re­versible ac­tion. It is also pos­si­ble that the user would con­firm the ir­re­versible ac­tion in the real world be­cause the user is be­ing ma­nipu­lated, and what­ever defenses we put in place against ma­nipu­la­tion are thrown off by the simu­la­tion hy­poth­e­sis.

Why doesn’t this also ap­ply to sub­jec­tive re­gret bounds?

In or­der to get a sub­jec­tive re­gret bound you need to con­sider an ap­pro­pri­ate prior. The way I ex­pect it to work is, the prior guaran­tees that some ac­tions are safe in the short-term: for ex­am­ple, do­ing noth­ing to the en­vi­ron­ment and ask­ing only suffi­ciently quan­tilized queries from the user (see this for one toy model of how “safe in the short-term” can be for­mal­ized). There­fore, Beta can­not at­tack with a hy­poth­e­sis that will force Alpha to act with­out con­sult­ing the user, since that hy­poth­e­sis would fall out­side the prior.

Now, you can say “with the right prior in­tent-al­ign­ment also works”. To which I an­swer, sure, but first it means that in­tent-al­ign­ment is in­suffi­cient in it­self, and sec­ond the as­sump­tions about the prior are do­ing all the work. In­deed, we can imag­ine that the on­tol­ogy on which the prior is defined in­cludes a “true re­ward” sym­bol s.t., by defi­ni­tion, the se­man­tics is what­ever the user truly wants. An agent that max­i­mizes ex­pected true re­ward then can be said to be in­tent-al­igned. If it’s do­ing some­thing bad from the user’s per­spec­tive, then it is just an “in­no­cent” mis­take. But, un­less we bake some spe­cific as­sump­tions about the true re­ward into the prior, such an agent can be any­thing at all.

Most ex­ist­ing math­e­mat­i­cal re­sults do not seem to be com­pet­i­tive, as they get their guaran­tees by do­ing some­thing that in­volves a search over the en­tire hy­poth­e­sis space.

This is re­lated to what I call the dis­tinc­tion be­tween “weak” and “strong fea­si­bil­ity”. Weak fea­si­bil­ity means al­gorithms that are polyno­mial time in the num­ber of states and ac­tions, or the num­ber of hy­pothe­ses. Strong fea­si­bil­ity is sup­posed to be some­thing like, polyno­mial time in the de­scrip­tion length of the hy­poth­e­sis.

It is true that cur­rently we only have strong fea­si­bil­ity re­sults for rel­a­tively sim­ple hy­poth­e­sis spaces (such as, sup­port vec­tor ma­chines). But, this seems to me just a symp­tom of ad­vances in heuris­tics out­pac­ing the the­ory. I don’t see any rea­son of prin­ci­ple that sig­nifi­cantly limits the strong fea­si­bil­ity re­sults we can ex­pect. In­deed, we already have some ad­vances in pro­vid­ing a the­o­ret­i­cal ba­sis for deep learn­ing.

How­ever, I speci­fi­cally don’t want to work on strong fea­si­bil­ity re­sults, since there is a sig­nifi­cant chance they would lead to break­throughs in ca­pa­bil­ity. In­stead, I pre­fer study­ing safety on the weak fea­si­bil­ity level un­til we un­der­stood ev­ery­thing im­por­tant on this level, and only then try­ing to ex­tend it to strong fea­si­bil­ity. This cre­ates some­what of a co­nun­drum where ap­par­ently the one thing that can con­vince you (and other peo­ple?) is the thing I don’t think should be done soon.

I could also imag­ine be­ing pretty in­ter­ested in a math­e­mat­i­cal defi­ni­tion of safety that I thought ac­tu­ally cap­tured “safety” with­out “pass­ing the buck”. I think sub­jec­tive re­gret bounds and CIRL both make some progress on this, but some­what “pass the buck” by re­quiring a well-speci­fied hy­poth­e­sis space for re­wards /​ be­liefs /​ ob­ser­va­tion mod­els.

Can you ex­plain what you mean here? I agree that just say­ing “sub­jec­tive re­gret bound” is not enough, we need to un­der­stand all the as­sump­tions the prior should satisfy, re­flect­ing con­sid­er­a­tions such as, what kind of queries can or can­not ma­nipu­late the user. Hence the use of quan­tiliza­tion and de­bate in Dialogic RL, for ex­am­ple.

• To which I an­swer, sure, but first it means that in­tent-al­ign­ment is in­suffi­cient in it­self, and sec­ond the as­sump­tions about the prior are do­ing all the work.

I com­pletely agree with this, but isn’t this also true of sub­jec­tive re­gret bounds /​ defi­ni­tion-op­ti­miza­tion? Like, when you write (em­pha­sis mine)

There­fore, Beta can­not at­tack with a hy­poth­e­sis that will force Alpha to act with­out con­sult­ing the user, since that hy­poth­e­sis would fall out­side the prior.

Isn’t the as­sump­tion about the prior “do­ing all the work”?

Maybe your point is that there are failure modes that aren’t cov­ered by in­tent al­ign­ment, in which case I agree, but also it seems like the OP very ex­plic­itly said this in many places. Just pick­ing one sen­tence (em­pha­sis mine):

An al­igned AI would try to figure out which thing is right, and like a hu­man it may or may not suc­ceed.

I don’t see any rea­son of prin­ci­ple that sig­nifi­cantly limits the strong fea­si­bil­ity re­sults we can ex­pect.

And mean­while I think very messy real world do­mains al­most always limit strong fea­si­bil­ity re­sults. To the ex­tent that you want your al­gorithms to do vi­sion or NLP, I think strong fea­si­bil­ity re­sults will have to talk about the en­vi­ron­ment; it seems quite in­fea­si­ble to do this with the real world.

That said, most of this be­lief comes from the fact that em­piri­cally it seems like the­ory of­ten breaks down when it hits the real world. The ab­stract ar­gu­ment is an at­tempt to ex­plain it; but I wouldn’t have much faith in the ab­stract ar­gu­ment by it­self (which is try­ing to quan­tify over all pos­si­ble ways of get­ting a strong fea­si­bil­ity re­sult).

How­ever, I speci­fi­cally don’t want to work on strong fea­si­bil­ity re­sults, since there is a sig­nifi­cant chance they would lead to break­throughs in ca­pa­bil­ity.

Idk, you could have a nondis­clo­sure-by-de­fault policy if you were wor­ried about this. Maybe this can’t work for you though. (As an aside, I hope this is what MIRI is do­ing, but they prob­a­bly aren’t.)

Can you ex­plain what you mean here?

Ba­si­cally what you said right af­ter:

I agree that just say­ing “sub­jec­tive re­gret bound” is not enough, we need to un­der­stand all the as­sump­tions the prior should satisfy, re­flect­ing con­sid­er­a­tions such as, what kind of queries can or can­not ma­nipu­late the user.
• ...first it means that in­tent-al­ign­ment is in­suffi­cient in it­self, and sec­ond the as­sump­tions about the prior are do­ing all the work.

I com­pletely agree with this, but isn’t this also true of sub­jec­tive re­gret bounds /​ defi­ni­tion-op­ti­miza­tion?

The idea is, we will solve the al­ign­ment prob­lem by (i) for­mu­lat­ing a suit­able learn­ing pro­to­col (ii) for­mal­iz­ing a set of as­sump­tions about re­al­ity and (iii) prov­ing that un­der these as­sump­tions, this learn­ing pro­to­col has a rea­son­able sub­jec­tive re­gret bound. So, the role of the sub­jec­tive re­gret bound is mak­ing sure that the what we came up with in i+ii is suffi­cient, and also guid­ing the search there. The sub­jec­tive re­gret bound does not tell us whether par­tic­u­lar as­sump­tions are re­al­is­tic: for this we need to use com­mon sense and knowl­edge out­side of the­o­ret­i­cal com­puter sci­ence (such as: physics, cog­ni­tive sci­ence, ex­per­i­men­tal ML re­search, evolu­tion­ary biol­ogy...)

Maybe your point is that there are failure modes that aren’t cov­ered by in­tent al­ign­ment, in which case I agree, but also it seems like the OP very ex­plic­itly said this in many places.

I dis­agree with the OP that (em­pha­sis mine):

I think that us­ing a broader defi­ni­tion (or the de re read­ing) would also be defen­si­ble, but I like it less be­cause it in­cludes many sub­prob­lems that I think (a) are much less ur­gent, (b) are likely to in­volve to­tally differ­ent tech­niques than the ur­gent part of al­ign­ment.

I think that in­tent al­ign­ment is too ill-defined, and to the ex­tent it is well-defined it is a very weak con­di­tion, that is not suffi­cient to ad­dress the ur­gent core of the prob­lem.

And mean­while I think very messy real world do­mains al­most always limit strong fea­si­bil­ity re­sults. To the ex­tent that you want your al­gorithms to do vi­sion or NLP, I think strong fea­si­bil­ity re­sults will have to talk about the en­vi­ron­ment; it seems quite in­fea­si­ble to do this with the real world.

I don’t think strong fea­si­bil­ity re­sults will have to talk about the en­vi­ron­ment, or rather, they will have to talk about it on a very high level of ab­strac­tion. For ex­am­ple, imag­ine that we prove that stochas­tic gra­di­ent de­scent on a neu­ral net­work with par­tic­u­lar ar­chi­tec­ture effi­ciently ag­nos­ti­cally learns any func­tion in some space, such that as the num­ber of neu­rons grows, this space effi­ciently ap­prox­i­mates any func­tion satis­fy­ing some kind of sim­ple and nat­u­ral “smooth­ness” con­di­tion (an ex­am­ple mo­ti­vated by already known re­sults). This is a strong fea­si­bil­ity re­sult. We can then de­bate whether an us­ing such a smooth ap­prox­i­ma­tion is suffi­cient for su­per­hu­man perfor­mance, but es­tab­lish­ing this re­quires differ­ent tools, like I said above.

The way I imag­ine it, AGI the­ory should ul­ti­mately ar­rive at some class of pri­ors that are on the one hand rich enough to de­serve to be called “gen­eral” (or, prac­ti­cally speak­ing, rich enough to pro­duce su­per­hu­man agents) and on the other hand nar­row enough to al­low for effi­cient al­gorithms. For ex­am­ple the Solomonoff prior is too rich, whereas a prior that (say) de­scribes ev­ery­thing in terms of an MDP with a small num­ber of states is too nar­row. Find­ing the golden path in be­tween is one of the big open prob­lems.

That said, most of this be­lief comes from the fact that em­piri­cally it seems like the­ory of­ten breaks down when it hits the real world.

Does it? I am not sure why you have this im­pres­sion. Cer­tainly there are phe­nom­ena in the real world that we don’t yet have enough the­ory to un­der­stand, and cer­tainly a given the­ory will fail in do­mains where its as­sump­tions are not jus­tified (where “fail” and “jus­tified” can be a man­ner of de­gree). And yet, the­ory ob­vi­ously played and plays a cen­tral role in sci­ence, so I don’t un­der­stand whence the fatal­ism.

How­ever, I speci­fi­cally don’t want to work on strong fea­si­bil­ity re­sults, since there is a sig­nifi­cant chance they would lead to break­throughs in ca­pa­bil­ity.

Idk, you could have a nondis­clo­sure-by-de­fault policy if you were wor­ried about this. Maybe this can’t work for you though.

That seems like it would be an ex­tremely not cost-effec­tive way of mak­ing progress. I would in­vest a lot of time and effort into some­thing that would only be dis­closed to the se­lect few, for the sole pur­pose of con­vinc­ing them of some­thing (as­sum­ing they are even in­ter­ested to un­der­stand it). I imag­ine that solv­ing AI risk will re­quire col­lab­o­ra­tion among many peo­ple, in­clud­ing shar­ing ideas and build­ing on other peo­ple’s ideas, and that’s not re­al­is­tic with­out pub­lish­ing. Cer­tainly I am not go­ing to write a Friendly AI on my home lap­top :)

• I think that in­tent al­ign­ment is too ill-defined, and to the ex­tent it is well-defined it is a very weak con­di­tion, that is not suffi­cient to ad­dress the ur­gent core of the prob­lem.

Okay, so there seem to be two dis­agree­ments:

• How bad is it that in­tent al­ign­ment is ill-defined

• Is work on in­tent al­ign­ment urgent

The first one seems pri­mar­ily about our dis­agree­ments on the util­ity of the­ory, which I’ll get to later.

For the sec­ond one, I don’t know what your ar­gu­ment is that the non-in­tent-al­ign­ment work is ur­gent. I agree that the simu­la­tion ex­am­ple you give is an ex­am­ple of how flawed episte­mol­ogy can sys­tem­at­i­cally lead to x-risk. I don’t see the ar­gu­ment that it is very likely (maybe the first few AGIs don’t think about simu­la­tions; maybe it’s im­pos­si­ble to con­struct such a con­vinc­ing hy­poth­e­sis). I es­pe­cially don’t see the ar­gu­ment that it is more likely than the failure mode in which a goal-di­rected AGI is op­ti­miz­ing for some­thing differ­ent from what hu­mans want.

(You might re­spond that in­tent al­ign­ment brings risk down from say 10% to 3%, whereas your agenda brings risk down from 10% to 1%. My re­sponse would be that once we have suc­cess­fully figured out in­tent al­ign­ment to bring risk from 10% to 3%, we can then fo­cus on build­ing a good prior to bring the risk down from 3% to 1%. All num­bers here are very made up.)

For ex­am­ple, imag­ine that we prove that stochas­tic gra­di­ent de­scent on a neu­ral net­work with par­tic­u­lar ar­chi­tec­ture effi­ciently ag­nos­ti­cally learns any func­tion in some space, such that as the num­ber of neu­rons grows, this space effi­ciently ap­prox­i­mates any func­tion satis­fy­ing some kind of sim­ple and nat­u­ral “smooth­ness” con­di­tion (an ex­am­ple mo­ti­vated by already known re­sults). This is a strong fea­si­bil­ity re­sult.

My guess is that any such re­sult will ei­ther re­quire sam­ples ex­po­nen­tial in the di­men­sion­al­ity of the in­put space (pro­hibitively ex­pen­sive) or the sim­ple and nat­u­ral con­di­tion won’t hold for the vast ma­jor­ity of cases that neu­ral net­works have been ap­plied to to­day.

I don’t find smooth­ness con­di­tions in par­tic­u­lar very com­pel­ling, be­cause many im­por­tant func­tions are not smooth (e.g. most things in­volv­ing an if con­di­tion).

I am not sure why you have this im­pres­sion.

Con­sider this ex­am­ple:

You are a bridge de­signer. You make the as­sump­tion that forces on the bridge will never ex­ceed some value K (nec­es­sary be­cause you can’t be ro­bust against un­bounded forces). You prove your de­sign will never col­lapse given this as­sump­tion. Your bridge col­lapses any­way be­cause of res­o­nance.

The broader point is that when the en­vi­ron­ment has lots of com­pli­cated in­ter­ac­tion effects, and you must make as­sump­tions, it is very hard to find as­sump­tions that ac­tu­ally hold.

And yet, the­ory ob­vi­ously played and plays a cen­tral role in sci­ence, so I don’t un­der­stand whence the fatal­ism.

The ar­eas of sci­ence in which the­ory is most cen­tral (e.g. physics) don’t re­quire as­sump­tions about some com­pli­cated stuff; they sim­ply aim to de­scribe ob­ser­va­tions. It’s re­ally the as­sump­tions that make me pes­simistic, which is why it would be a sig­nifi­cant up­date if I saw:

a math­e­mat­i­cal defi­ni­tion of safety that I thought ac­tu­ally cap­tured “safety” with­out “pass­ing the buck”

It would similarly up­date me if you had a piece of code that (per­haps with ar­bi­trary amounts of com­pute) could take in an AI sys­tem and out­put “safe” or “un­safe”, and I would trust that out­put. (I’d ex­pect that a math­e­mat­i­cal defi­ni­tion could be turned into such a piece of code if it doesn’t “pass the buck”.)

You might re­spond that in­tent al­ign­ment re­quires as­sump­tions too, which I agree with, but the the­ory-based ap­proach re­quires you to limit your as­sump­tions to things that can be writ­ten down in math (e.g. this func­tion is K-Lip­s­chitz) whereas a non-the­ory-based ap­proach can use “hand­wavy” as­sump­tions (e.g. a hu­man think­ing for a day is safe), which dras­ti­cally opens up the space of op­tions and makes it more likely that you can find an as­sump­tion that is ac­tu­ally mostly true.

That seems like it would be an ex­tremely not cost-effec­tive way of mak­ing progress.

Yeah, I broadly agree; I mostly don’t un­der­stand MIRI’s po­si­tion and thought you might share it, but it seems you don’t. I agree that over­all it’s a tough prob­lem. My per­sonal po­si­tion would be to do it pub­li­cly any­way; it seems way bet­ter to have an ap­proach to AI that we un­der­stand than the cur­rent ap­proach, even if it short­ens timelines. (Con­sider the unilat­er­al­ist curse; but also con­sider that other peo­ple do agree with me, if not the peo­ple at MIRI /​ LessWrong.)

• For the sec­ond one, I don’t know what your ar­gu­ment is that the non-in­tent-al­ign­ment work is ur­gent. I agree that the simu­la­tion ex­am­ple you give is an ex­am­ple of how flawed episte­mol­ogy can sys­tem­at­i­cally lead to x-risk. I don’t see the ar­gu­ment that it is very likely.

First, even work­ing on un­likely risks can be ur­gent, if the risk is great and the time needed to solve it might be long enough com­pared to the timeline un­til the risk. Se­cond, I think this ex­am­ple shows that is far from straight­for­ward to even in­for­mally define what in­tent-al­ign­ment is. Hence, I am skep­ti­cal about the use­ful­ness of in­tent-al­ign­ment.

For a more “mun­dane” ex­am­ple, take IRL. Is IRL in­tent al­igned? What if its as­sump­tions about hu­man be­hav­ior are in­ad­e­quate and it ends up in­fer­ring an en­tirely wrong re­ward func­tion? Is it still in­tent-al­igned since it is try­ing to do what the user wants, it is just wrong about what the user wants? Where is the line be­tween “be­ing wrong about what the user wants” and op­ti­miz­ing some­thing com­pletely un­re­lated to what the user wants?

It seems like in­tent-al­ign­ment de­pends on our in­ter­pre­ta­tion of what the al­gorithm does, rather than only on the al­gorithm it­self. But ac­tual safety is not a mat­ter of in­ter­pre­ta­tion, at least not in this sense.

For ex­am­ple, imag­ine that we prove that stochas­tic gra­di­ent de­scent on a neu­ral net­work with par­tic­u­lar ar­chi­tec­ture effi­ciently ag­nos­ti­cally learns any func­tion in some space, such that as the num­ber of neu­rons grows, this space effi­ciently ap­prox­i­mates any func­tion satis­fy­ing some kind of sim­ple and nat­u­ral “smooth­ness” con­di­tion (an ex­am­ple mo­ti­vated by already known re­sults). This is a strong fea­si­bil­ity re­sult.

My guess is that any such re­sult will ei­ther re­quire sam­ples ex­po­nen­tial in the di­men­sion­al­ity of the in­put space (pro­hibitively ex­pen­sive) or the sim­ple and nat­u­ral con­di­tion won’t hold for the vast ma­jor­ity of cases that neu­ral net­works have been ap­plied to to­day.

I don’t know why you think so, but at least this is a good crux since it seems en­tirely falsifi­able. In an any case, ex­po­nen­tial sam­ple com­plex­ity definitely doesn’t count as “strong fea­si­bil­ity”.

I don’t find smooth­ness con­di­tions in par­tic­u­lar very com­pel­ling, be­cause many im­por­tant func­tions are not smooth (e.g. most things in­volv­ing an if con­di­tion).

Smooth­ness is just an ex­am­ple, it is not nec­es­sar­ily the fi­nal an­swer. But also, in clas­sifi­ca­tion prob­lems smooth­ness usu­ally trans­lates to a mar­gin re­quire­ment (the classes have to be sep­a­rated with suffi­cient dis­tance). So, in some sense smooth­ness al­lows for “if con­di­tions” as long as you’re not too sen­si­tive to the thresh­old.

You are a bridge de­signer. You make the as­sump­tion that forces on the bridge will never ex­ceed some value K (nec­es­sary be­cause you can’t be ro­bust against un­bounded forces). You prove your de­sign will never col­lapse given this as­sump­tion. Your bridge col­lapses any­way be­cause of res­o­nance.

I don’t un­der­stand this ex­am­ple. If the bridge can never col­lapse as long as the out­side forces don’t ex­ceed K, then res­o­nance is cov­ered as well (as long as it is pro­duced by forces be­low K). Maybe you meant that the out­side forces are also as­sumed to be sta­tion­ary.

The broader point is that when the en­vi­ron­ment has lots of com­pli­cated in­ter­ac­tion effects, and you must make as­sump­tions, it is very hard to find as­sump­tions that ac­tu­ally hold.

Nev­er­the­less most en­g­ineer­ing pro­jects make heavy use of the­ory. I don’t un­der­stand why you think that AGI must be differ­ent?

The is­sue of as­sump­tions in strong fea­si­bil­ity is equiv­a­lent to the ques­tion of, whether pow­er­ful agents re­quire highly in­formed pri­ors. If you need com­plex as­sump­tions then effec­tively you have a highly in­formed prior, whereas if your prior is un­in­formed then it cor­re­sponds to sim­ple as­sump­tions. I think that Han­son (for ex­am­ple) be­lieves that it is in­deed nec­es­sary to have a highly in­formed prior, which is why pow­er­ful AI al­gorithms will be com­plex (since they have to en­code this prior) and progress in AI will be slow (since the prior needs to be man­u­ally con­structed brick by brick). I find this sce­nario un­likely (for ex­am­ple be­cause hu­mans suc­cess­fully solve tasks far out­side the an­ces­tral en­vi­ron­ment, so they can’t be rely­ing on ge­net­i­cally built-in pri­ors that much), but not ruled out.

How­ever, I as­sumed that your po­si­tion is not Han­so­nian: cor­rect me if I’m wrong, but I as­sumed that you be­lieved deep learn­ing or some­thing similar is likely to lead to AGI rel­a­tively soon. Even if not, you were skep­ti­cal about strong fea­si­bil­ity re­sults even for deep learn­ing, re­gard­less of hy­po­thet­i­cal fu­ture AI tech­nol­ogy. But, it doesn’t look like deep learn­ing re­lies on highly in­formed pri­ors. What we have is, rel­a­tively sim­ple al­gorithms that can, with rel­a­tively small (or even no) adap­ta­tions solve prob­lems in com­pletely differ­ent do­mains (image pro­cess­ing, au­dio pro­cess­ing, NLP, play­ing many very differ­ent games, pro­tein fold­ing...) So, how is it pos­si­ble that all of these do­mains have some highly com­plex prop­erty that they share, and that is some­how en­coded in the deep learn­ing al­gorithm?

It’s re­ally the as­sump­tions that make me pes­simistic, which is why it would be a sig­nifi­cant up­date if I saw a math­e­mat­i­cal defi­ni­tion of safety that I thought ac­tu­ally cap­tured “safety” with­out “pass­ing the buck”

I’m cu­ri­ous whether prov­ing a weakly fea­si­ble sub­jec­tive re­gret bound un­der as­sump­tions that you agree are oth­er­wise re­al­is­tic qual­ifies or not?

...but the the­ory-based ap­proach re­quires you to limit your as­sump­tions to things that can be writ­ten down in math (e.g. this func­tion is K-Lip­s­chitz) whereas a non-the­ory-based ap­proach can use “hand­wavy” as­sump­tions (e.g. a hu­man think­ing for a day is safe), which dras­ti­cally opens up the space of op­tions and makes it more likely that you can find an as­sump­tion that is ac­tu­ally mostly true.

I can quite eas­ily imag­ine how “hu­man think­ing for a day is safe” can be a math­e­mat­i­cal as­sump­tion. In gen­eral, which as­sump­tions are for­mal­iz­able de­pends on the on­tol­ogy of your math­e­mat­i­cal model (that is, which real-world con­cepts cor­re­spond to the “atomic” in­gre­di­ents of your model). The choice of on­tol­ogy is part of draw­ing the line be­tween what you want your math­e­mat­i­cal the­ory to prove and what you want to bring in as out­side as­sump­tions. Like I said be­fore, this line definitely has to be drawn some­where, but it doesn’t at all fol­low that the en­tire ap­proach is use­less.

• First, even work­ing on un­likely risks can be ur­gent, if the risk is great and the time needed to solve it might be long enough com­pared to the timeline un­til the risk.

Okay. What’s the ar­gu­ment that the risk is great (I as­sume this means “very bad” and not “very likely” since by hy­poth­e­sis it is un­likely), or that we need a lot of time to solve it?

Se­cond, I think this ex­am­ple shows that is far from straight­for­ward to even in­for­mally define what in­tent-al­ign­ment is.

I agree with this; I don’t think this is one of our cruxes. (I do think that in most cases, if we have all the in­for­ma­tion about the situ­a­tion, it will be fairly clear whether some­thing is in­tent al­igned or not, but cer­tainly there are situ­a­tions in which it’s am­bigu­ous. I think cor­rigi­bil­ity is bet­ter-in­for­mally-defined, though still there will be am­bigu­ous situ­a­tions.)

Is IRL in­tent al­igned?

Depends on the de­tails, but the way you de­scribe it, no, it isn’t. (Though I can see the fuzzi­ness here.) I think it is es­pe­cially clear that it is not cor­rigible.

It seems like in­tent-al­ign­ment de­pends on our in­ter­pre­ta­tion of what the al­gorithm does, rather than only on the al­gorithm it­self. But ac­tual safety is not a mat­ter of in­ter­pre­ta­tion, at least not in this sense.

Yup, I agree (with the caveat that it doesn’t have to be a hu­man’s in­ter­pre­ta­tion). Nonethe­less, an in­ter­pre­ta­tion of what the al­gorithm does can give you a lot of ev­i­dence about whether or not some­thing is ac­tu­ally safe.

If the bridge can never col­lapse as long as the out­side forces don’t ex­ceed K, then res­o­nance is cov­ered as well (as long as it is pro­duced by forces be­low K).

I meant that K was set con­sid­er­ing wind forces, cars, etc. and was set too low to ac­count for res­o­nance, be­cause you didn’t think about res­o­nance be­fore­hand.

(I guess res­o­nance doesn’t in­volve large forces, it in­volves co­or­di­nated forces. The point is just that it seems very plau­si­ble that some­one might de­sign a the­o­ret­i­cal model of the en­vi­ron­ment in which the bridge is safe, but that model ne­glects to in­clude res­o­nance be­cause the de­signer didn’t think of it.)

Nev­er­the­less most en­g­ineer­ing pro­jects make heavy use of the­ory.

I’m not deny­ing that? I’m not ar­gu­ing against the­ory in gen­eral; I’m ar­gu­ing against the­o­ret­i­cal safety guaran­tees. I think in prac­tice our con­fi­dence in safety of­ten comes from em­piri­cal tests.

I’m cu­ri­ous whether prov­ing a weakly fea­si­ble sub­jec­tive re­gret bound un­der as­sump­tions that you agree are oth­er­wise re­al­is­tic qual­ifies or not?

Prob­a­bly? Hon­estly, I’m don’t think you even need to prove the sub­jec­tive re­gret bound; if you wrote down as­sump­tions that I agree are re­al­is­tic and cap­ture safety (such that you could write code that de­ter­mines whether or not an AI sys­tem is safe) that alone would qual­ify. It would be fine if it some­times said things are un­safe when they are safe, as long as it isn’t too con­ser­va­tive; a weak fea­si­bil­ity re­sult would help show that it isn’t too con­ser­va­tive.

I can quite eas­ily imag­ine how “hu­man think­ing for a day is safe” can be a math­e­mat­i­cal as­sump­tion.

Agreed, but if you want to even­tu­ally talk about neu­ral nets so that you are talk­ing about the AI sys­tem you are ac­tu­ally build­ing, you need to use the neu­ral net on­tol­ogy, and then “hu­man think­ing for a day” is not some­thing you can ex­press.

• Okay. What’s the ar­gu­ment that the risk is great (I as­sume this means “very bad” and not “very likely” since by hy­poth­e­sis it is un­likely), or that we need a lot of time to solve it?

The rea­sons the risk are great are stan­dard ar­gu­ments, so I am a lit­tle con­fused why you ask about this. The setup effec­tively al­lows a su­per­in­tel­li­gent mal­i­cious agent (Beta) ac­cess to our uni­verse, which can re­sult in ex­treme op­ti­miza­tion of our uni­verse to­wards in­hu­man val­ues and tremen­dous loss of value-ac­cord­ing-to-hu­mans. The rea­son we need a lot of time to solve it is sim­ply that (i) it doesn’t seem to be an in­stance of some stan­dard prob­lem type which we have stan­dard tools to solve and (ii) some peo­ple have been think­ing on these ques­tions for a while by now and did not come up with an easy solu­tion.

It seems like in­tent-al­ign­ment de­pends on our in­ter­pre­ta­tion of what the al­gorithm does, rather than only on the al­gorithm it­self. But ac­tual safety is not a mat­ter of in­ter­pre­ta­tion, at least not in this sense.

Yup, I agree (with the caveat that it doesn’t have to be a hu­man’s in­ter­pre­ta­tion). Nonethe­less, an in­ter­pre­ta­tion of what the al­gorithm does can give you a lot of ev­i­dence about whether or not some­thing is ac­tu­ally safe.

Then, I don’t un­der­stand why you be­lieve that work on any­thing other than in­tent-al­ign­ment is much less ur­gent?

The point is just that it seems very plau­si­ble that some­one might de­sign a the­o­ret­i­cal model of the en­vi­ron­ment in which the bridge is safe, but that model ne­glects to in­clude res­o­nance be­cause the de­signer didn’t think of it.

“Res­o­nance” is not some­thing you need to ex­plic­itly in­clude in your model, it is just a con­se­quence of the equa­tions of mo­tion for an os­cilla­tor. This is ac­tu­ally an im­por­tant les­son about why we need the­ory: to con­struct a use­ful the­o­ret­i­cal model you don’t need to know all pos­si­ble failure modes, you only need a rea­son­able set of as­sump­tions.

I think in prac­tice our con­fi­dence in safety of­ten comes from em­piri­cal tests.

I think that in prac­tice our con­fi­dence in safety comes from a com­bi­na­tion of the­ory and em­piri­cal tests. And, the higher the stakes and the more un­usual the en­deavor, the more the­ory you need. If you’re do­ing some­thing low stakes or some­thing very similar to things that have been tried many times be­fore, you can rely on trial and er­ror. But if you’re send­ing a space­ship to Mars (or mak­ing a su­per­in­tel­li­gent AI), trial and er­ror is too ex­pen­sive. Yes, you will test the mod­ules on Earth in con­di­tions as similar to the real en­vi­ron­ment as you can (re­spec­tively, you will do ex­per­i­ments with nar­row AI). But ul­ti­mately, you need the­o­ret­i­cal knowl­edge to know what can be safely in­ferred from these ex­per­i­ments. Without the­ory you can­not ex­trap­o­late.

I can quite eas­ily imag­ine how “hu­man think­ing for a day is safe” can be a math­e­mat­i­cal as­sump­tion.

Agreed, but if you want to even­tu­ally talk about neu­ral nets so that you are talk­ing about the AI sys­tem you are ac­tu­ally build­ing, you need to use the neu­ral net on­tol­ogy, and then “hu­man think­ing for a day” is not some­thing you can ex­press.

I dis­agree. For ex­am­ple, sup­pose that we have a the­o­rem say­ing that an ANN with par­tic­u­lar ar­chi­tec­ture and learn­ing al­gorithm can learn any func­tion in­side some space with given ac­cu­racy. And, sup­pose that “hu­man think­ing for a day” is rep­re­sented by a math­e­mat­i­cal func­tion that we as­sume to be in­side and that we as­sume to be “safe” in some for­mal sense (for ex­am­ple, it com­putes an ac­tion that doesn’t lose much long-term value). Then, your model can prove that imi­ta­tion learn­ing ap­plied to hu­man think­ing for a day is safe. Of course, this ex­am­ple is triv­ial (mod­ulo the the­o­rem about ANNs), but for more com­plex set­tings we can get re­sults that are non-triv­ial.

• The rea­sons the risk are great are stan­dard ar­gu­ments, so I am a lit­tle con­fused why you ask about this.

Sorry, I meant what are the rea­sons that the risk greater than the risk from a failure of in­tent al­ign­ment? The ques­tion was meant to be com­pared to the coun­ter­fac­tual of work on in­tent al­ign­ment, since the un­der­ly­ing dis­agree­ment is about com­par­ing work on in­tent al­ign­ment to other AI safety work. Similarly for the ques­tion about why it might take a long time to solve.

Then, I don’t un­der­stand why you be­lieve that work on any­thing other than in­tent-al­ign­ment is much less ur­gent?

I’m claiming that in­tent al­ign­ment cap­tures a large pro­por­tion of pos­si­ble failure modes, that seem par­tic­u­larly amenable to a solu­tion.

Imag­ine that a fair coin was go­ing to be flipped 21 times, and you need to say whether there were more heads than tails. By de­fault you see noth­ing, but you could try to build two ma­chines:

1. Ma­chine A is easy to build but not very ro­bust; it re­ports the out­come of each coin flip but has a 1% chance of er­ror for each coin flip.

2. Ma­chine B is hard to build but very ro­bust; it re­ports the out­come of each coin flip perfectly. How­ever, you only have a 50% chance of build­ing it by the time you need it.

In this situ­a­tion, ma­chine A is a much bet­ter plan.

(The ex­am­ple is meant to illus­trate the phe­nomenon by which you might want to choose a riskier but eas­ier-to-cre­ate op­tion; it’s not meant to prop­erly model in­tent al­ign­ment vs. other stuff on other axes.)

This is ac­tu­ally an im­por­tant les­son about why we need the­ory: to con­struct a use­ful the­o­ret­i­cal model you don’t need to know all pos­si­ble failure modes, you only need a rea­son­able set of as­sump­tions.

I cer­tainly agree with that. My mo­ti­va­tion in choos­ing this ex­am­ple is that em­piri­cally we should not be able to prove that bridges are safe w.r.t res­o­nance, be­cause in fact they are not safe and do fall when res­o­nance oc­curs. (Maybe to­day bridge-build­ing tech­nol­ogy has ad­vanced such that we are able to do such proofs, I don’t know, but at least in the past that would not have been the case.)

In this case, we ei­ther fail to prove any­thing, or we make un­re­al­is­tic as­sump­tions that do not hold in re­al­ity and get a proof of safety. Similarly, I think in many cases in­volv­ing prop­er­ties about a com­plex real en­vi­ron­ment, your two op­tions are 1. don’t prove things or 2. prove things with un­re­al­is­tic as­sump­tions that don’t hold.

But if you’re send­ing a space­ship to Mars (or mak­ing a su­per­in­tel­li­gent AI), trial and er­ror is too ex­pen­sive. [...] Without the­ory you can­not ex­trap­o­late.

I am not sug­gest­ing that we throw away all logic and make ran­dom ed­its to lines of code and try them out un­til we find a safe AI. I am sim­ply say­ing that our things-that-al­low-us-to-ex­trap­o­late need not be ex­pressed in math with the­o­rems. I don’t build math­e­mat­i­cal the­o­ries of how to write code, and usu­ally don’t prove my code cor­rect; nonethe­less I seem to ex­trap­o­late quite well to new cod­ing prob­lems.

It also sounds like you’re mak­ing a nor­ma­tive claim for proofs; I’m more in­ter­ested in the em­piri­cal claim. (But I might be mis­read­ing you here.)

I dis­agree. For ex­am­ple, [...]

Cer­tainly you can come up with bridg­ing as­sump­tions to bridge be­tween lev­els of ab­strac­tion (in this case the as­sump­tion that “hu­man think­ing for a day” is within F). I would ex­pect that I would find some bridg­ing as­sump­tion im­plau­si­ble in these set­tings.

• I’m claiming that in­tent al­ign­ment cap­tures a large pro­por­tion of pos­si­ble failure modes, that seem par­tic­u­larly amenable to a solu­tion.

Imag­ine that a fair coin was go­ing to be flipped 21 times, and you need to say whether there were more heads than tails. By de­fault you see noth­ing, but you could try to build two ma­chines:

1. Ma­chine A is easy to build but not very ro­bust; it re­ports the out­come of each coin flip but has a 1% chance of er­ror for each coin flip.

1. Ma­chine B is hard to build but very ro­bust; it re­ports the out­come of each coin flip perfectly. How­ever, you only have a 50% chance of build­ing it by the time you need it.

In this situ­a­tion, ma­chine A is a much bet­ter plan.

I am strug­gling to un­der­stand how does it work in prac­tice. For ex­am­ple, con­sider di­alogic RL. It is a scheme in­tended to solve AI al­ign­ment in the strong sense. The in­tent-al­ign­ment the­sis seems to say that I should be able to find some proper sub­set of the fea­tures in the scheme which is suffi­cient for al­ign­ment in prac­tice. I can ap­prox­i­mately list the set of fea­tures as:

1. Ba­sic ques­tion-an­swer protocol

2. Nat­u­ral lan­guage annotation

3. Quan­tiliza­tion of questions

4. De­bate over annotations

5. Deal­ing with no user answer

6. Deal­ing with in­con­sis­tent user answers

7. Deal­ing with chang­ing user beliefs

8. Deal­ing with chang­ing user preferences

9. Self-refer­ence in user beliefs

10. Quan­tiliza­tion of com­pu­ta­tions (to com­bat non-Carte­sian dae­mons, this is not in the origi­nal pro­posal)

11. Re­v­erse questions

12. Trans­la­tion of coun­ter­fac­tu­als from user frame to AI frame

13. User be­liefs about computations

EDIT: 14. Con­fi­dence thresh­old for risky actions

Which of these fea­tures are nec­es­sary for in­tent-al­ign­ment and which are only nec­es­sary for strong al­ign­ment? I can’t tell.

I cer­tainly agree with that. My mo­ti­va­tion in choos­ing this ex­am­ple is that em­piri­cally we should not be able to prove that bridges are safe w.r.t res­o­nance, be­cause in fact they are not safe and do fall when res­o­nance oc­curs.

I am not an ex­pert but I ex­pect that bridges are con­structed so that they don’t en­ter high-am­pli­tude res­o­nance in the rele­vant range of fre­quen­cies (which is an ex­am­ple of us­ing as­sump­tions in our mod­els that need in­de­pen­dent val­i­da­tion). We want bridges that don’t fall, don’t we?

I don’t build math­e­mat­i­cal the­o­ries of how to write code, and usu­ally don’t prove my code correct

On the other hand, I use math­e­mat­i­cal mod­els to write code for ap­pli­ca­tions all the time, with some suc­cess I dare­say. I guess that differ­ent ex­pe­rience pro­duces differ­ent in­tu­itions.

It also sounds like you’re mak­ing a nor­ma­tive claim for proofs; I’m more in­ter­ested in the em­piri­cal claim.

I am mak­ing both claims to some de­gree. I can imag­ine a uni­verse in which the em­piri­cal claim is true, and I con­sider it plau­si­ble (but far from cer­tain) that we live in such a uni­verse. But, even just un­der­stand­ing whether we live in such a uni­verse re­quires build­ing a math­e­mat­i­cal the­ory.

• Which of these fea­tures are nec­es­sary for in­tent-al­ign­ment and which are only nec­es­sary for strong al­ign­ment?

As far as I can tell, 2, 3, 4, and 10 are pro­posed im­ple­men­ta­tions, not fea­tures. (E.g. the fea­ture cor­re­spond­ing to 3 is “doesn’t ma­nipu­late the user” or some­thing like that.) I’m not sure what 9, 11 and 13 are about. For the oth­ers, I’d say they’re all fea­tures that an in­tent-al­igned AI should have; just not in liter­ally all pos­si­ble situ­a­tions. But the im­ple­men­ta­tion you want is some­thing that aims for in­tent al­ign­ment; then be­cause the AI is in­tent al­igned it should have fea­tures 1, 5, 6, 7, 8. Maybe fea­ture 12 is one I think is not cov­ered by in­tent al­ign­ment, but is im­por­tant to have.

I am not an ex­pert but I ex­pect that bridges are con­structed so that they don’t en­ter high-am­pli­tude res­o­nance in the rele­vant range of fre­quen­cies (which is an ex­am­ple of us­ing as­sump­tions in our mod­els that need in­de­pen­dent val­i­da­tion).

This is prob­a­bly true now that we know about res­o­nance (be­cause bridges have fallen down due to res­o­nance); I was ask­ing you to take the per­spec­tive where you haven’t yet seen a bridge fall down from res­o­nance, and so you don’t think about it.

On the other hand, I use math­e­mat­i­cal mod­els to write code for ap­pli­ca­tions all the time, with some suc­cess I dare­say. I guess that differ­ent ex­pe­rience pro­duces differ­ent in­tu­itions.

Maybe I’m fal­ling prey to the typ­i­cal mind fal­lacy, but I re­ally doubt that you use math­e­mat­i­cal mod­els to write code in the way that I mean, and I sus­pect you in­stead mi­s­un­der­stood what I meant.

Like, if I asked you to write code to check if an el­e­ment is pre­sent in an ar­ray, do you prove the­o­rems? I cer­tainly ex­pect that you have an in­tu­itive model of how your pro­gram­ming lan­guage of choice works, and that model in­forms the code that you write, but it seems wrong to me to de­scribe what I do, what all of my stu­dents do, and what I ex­pect you do as us­ing a “math­e­mat­i­cal the­ory of how to write code”.

But, even just un­der­stand­ing whether we live in such a uni­verse re­quires build­ing a math­e­mat­i­cal the­ory.

I’m cu­ri­ous what you think doesn’t re­quire build­ing a math­e­mat­i­cal the­ory? It seems to me that pre­dict­ing whether or not we are doomed if we don’t have a proof of safety is the sort of thing the AI safety com­mu­nity has done a lot of with­out a math­e­mat­i­cal the­ory. (Like, that’s how I in­ter­pret the rocket al­ign­ment and se­cu­rity mind­set posts.)

• As far as I can tell, 2, 3, 4, and 10 are pro­posed im­ple­men­ta­tions, not fea­tures. (E.g. the fea­ture cor­re­spond­ing to 3 is “doesn’t ma­nipu­late the user” or some­thing like that.) I’m not sure what 9, 11 and 13 are about. For the oth­ers, I’d say they’re all fea­tures that an in­tent-al­igned AI should have; just not in liter­ally all pos­si­ble situ­a­tions. But the im­ple­men­ta­tion you want is some­thing that aims for in­tent al­ign­ment; then be­cause the AI is in­tent al­igned it should have fea­tures 1, 5, 6, 7, 8. Maybe fea­ture 12 is one I think is not cov­ered by in­tent al­ign­ment, but is im­por­tant to have.

Hmm. I ap­pre­ci­ate the effort, but I don’t un­der­stand this an­swer. Maybe dis­cussing this point fur­ther is not pro­duc­tive in this for­mat.

I am not an ex­pert but I ex­pect that bridges are con­structed so that they don’t en­ter high-am­pli­tude res­o­nance in the rele­vant range of fre­quen­cies (which is an ex­am­ple of us­ing as­sump­tions in our mod­els that need in­de­pen­dent val­i­da­tion).

This is prob­a­bly true now that we know about res­o­nance (be­cause bridges have fallen down due to res­o­nance); I was ask­ing you to take the per­spec­tive where you haven’t yet seen a bridge fall down from res­o­nance, and so you don’t think about it.

Yes, and in that per­spec­tive, the math­e­mat­i­cal model can tell me about res­o­nance. It’s ac­tu­ally in­cred­ibly easy: res­o­nance ap­pears already in sim­ple har­monic os­cilla­tors. More­over, even if I did not ex­plic­itly un­der­stand res­o­nance, if I proved that the bridge is sta­ble un­der cer­tain as­sump­tions about ex­ter­nal forces mag­ni­tudes and space­time spec­trum, it au­to­mat­i­cally guaran­tees that res­o­nance will not crash the bridge (as long as the as­sump­tions are re­al­is­tic). Ob­vi­ously peo­ple have not been so cau­tious over his­tory, but that doesn’t mean we should be care­less about AGI as well.

I un­der­stand the ar­gu­ment that some­times cre­at­ing and an­a­lyz­ing a re­al­is­tic math­e­mat­i­cal model is difficult. I agree that un­der time pres­sure it might be bet­ter to com­pro­mise on a com­bi­na­tion of un­re­al­is­tic math­e­mat­i­cal mod­els, em­piri­cal data and in­for­mal rea­son­ing. But I don’t un­der­stand why should we give up so soon? We can work to­wards re­al­is­tic math­e­mat­i­cal mod­els and pre­pare fal­lbacks, and even if we don’t ar­rive at a re­al­is­tic math­e­mat­i­cal model it is likely that the effort will pro­duce valuable in­sights.

Maybe I’m fal­ling prey to the typ­i­cal mind fal­lacy, but I re­ally doubt that you use math­e­mat­i­cal mod­els to write code in the way that I mean, and I sus­pect you in­stead mi­s­un­der­stood what I meant.

Like, if I asked you to write code to check if an el­e­ment is pre­sent in an ar­ray, do you prove the­o­rems? I cer­tainly ex­pect that you have an in­tu­itive model of how your pro­gram­ming lan­guage of choice works, and that model in­forms the code that you write, but it seems wrong to me to de­scribe what I do, what all of my stu­dents do, and what I ex­pect you do as us­ing a “math­e­mat­i­cal the­ory of how to write code”.

First, if I am asked to check whether an el­e­ment is in an ar­ray, or some other easy ma­nipu­la­tion of data struc­tures, I ob­vi­ously don’t liter­ally start prov­ing a the­o­rem with pen­cil and pa­per. How­ever, my not-fully-for­mal rea­son­ing is such that I could prove a the­o­rem if I wanted to. My model is not ex­actly “in­tu­itive”: I could ex­plic­itly ex­plain ev­ery step. And, this is ex­actly how all of math­e­mat­ics works! Math­e­mat­i­ci­ans don’t write proofs that are ma­chine ver­ifi­able (some peo­ple do that to­day, but it’s a novel and tiny frac­tion of math­e­mat­ics). They write proofs that are good enough so that all the in­for­mal steps can be eas­ily made for­mal by any­one with rea­son­able back­ground in the field (but ac­tu­ally do­ing that would be very la­bor in­ten­sive).

Se­cond, what I ac­tu­ally meant is ex­am­ples like, I am us­ing an al­gorithm to solve a sys­tem of lin­ear equa­tions, or find the max­i­mal match­ing in a graph, or find a ro­ta­tion ma­trix that min­i­mizes the sum of square dis­tances be­tween two sets, be­cause I have a proof that this al­gorithm works (or, in some cases, a proof that it at least pro­duces the right an­swer when it con­verges). More­over, this ap­plies to prob­lems that ex­plic­itly in­volve the phys­i­cal world as well, such as Kal­man filters or con­trol loops.

Of course, in the lat­ter case we need to make some as­sump­tions about the phys­i­cal world in or­der to prove any­thing. It’s true that in ap­pli­ca­tions the as­sump­tions are of­ten false, and we merely hope that they are good enough ap­prox­i­ma­tions. But, when the ex­tra effort is jus­tified, we can do bet­ter: we can perform a math­e­mat­i­cal anal­y­sis of how much the vi­o­la­tion of these as­sump­tions af­fects the re­sult. Then, we can use out­side knowl­edge to ver­ify that the vi­o­la­tions are within the per­mis­si­ble mar­gin.

Third, we could also liter­ally prove ma­chine-ver­ifi­able the­o­rems about the code. This is called for­mal ver­ifi­ca­tion, and peo­ple do that some­times when the stakes are high (as they definitely are with AGI), al­though in this case I have no per­sonal ex­pe­rience. But, this is just a “side benefit” of what I was talk­ing about. We need the math­e­mat­i­cal the­ory to know that our al­gorithms are safe. For­mal ver­ifi­ca­tion “merely” tells us that the im­ple­men­ta­tion doesn’t have bugs (which is some­thing we should definitely worry about too, when it be­comes rele­vant).

I’m cu­ri­ous what you think doesn’t re­quire build­ing a math­e­mat­i­cal the­ory? It seems to me that pre­dict­ing whether or not we are doomed if we don’t have a proof of safety is the sort of thing the AI safety com­mu­nity has done a lot of with­out a math­e­mat­i­cal the­ory. (Like, that’s how I in­ter­pret the rocket al­ign­ment and se­cu­rity mind­set posts.)

I’m not sure about the scope of your ques­tion? I made a sand­wich this morn­ing with­out build­ing math­e­mat­i­cal the­ory :) I think that the AI safety com­mu­nity definitely pro­duced some im­por­tant ar­gu­ments about AI risk, and these ar­gu­ments are valid ev­i­dence. But, I con­sider most of the big ques­tions to be far from set­tled, and I don’t see how they could be set­tled only with this kind of rea­son­ing.

• You made a claim a few com­ments above:

But ul­ti­mately, you need the­o­ret­i­cal knowl­edge to know what can be safely in­ferred from these ex­per­i­ments. Without the­ory you can­not ex­trap­o­late.

I’m strug­gling to un­der­stand what you mean by “the­ory” here, and the pro­gram­ming ex­am­ple was try­ing to get at that, but not very suc­cess­fully. So let’s take the sand­wich ex­am­ple:

I made a sand­wich this morn­ing with­out build­ing math­e­mat­i­cal the­ory :)

Pre­sum­ably the in­gre­di­ents were in a slightly differ­ent con­figu­ra­tion than you had ever seen them be­fore, but you were still able to “ex­trap­o­late” to figure out how to make a sand­wich any­way. Why didn’t you need the­ory for that ex­trap­o­la­tion?

Ob­vi­ously this is a silly ex­am­ple, but I don’t cur­rently see any qual­i­ta­tive differ­ence be­tween sand­wich-mak­ing-ex­trap­o­la­tion, and the sort of ex­trap­o­la­tion we do when we make qual­i­ta­tive ar­gu­ments about AI risk. Why trust the former but not the lat­ter? One is an­swer is that the lat­ter is more com­plex, but you seem to be ar­gu­ing some­thing else.

• I hadn’t re­al­ized this post was nom­i­nated, par­tially be­cause of my com­ment, so here’s a late re­view. I ba­si­cally con­tinue to agree with ev­ery­thing I wrote then, and I con­tinue to like this post for those rea­sons, and so I sup­port in­clud­ing it in the LW Re­view.

Since writ­ing the com­ment, I’ve come across an­other ar­gu­ment for think­ing about in­tent al­ign­ment—it seems like a “gen­er­al­iza­tion” of as­sis­tance games /​ CIRL, which it­self seems like a for­mal­iza­tion of an al­igned agent in a toy set­ting. In as­sis­tance games, the agent ex­plic­itly main­tains a dis­tri­bu­tion over pos­si­ble hu­man re­ward func­tions, and in­stru­men­tally gath­ers in­for­ma­tion about hu­man prefer­ences by in­ter­act­ing with the hu­man. With in­tent al­ign­ment, since the agent is try­ing to help the hu­man, we ex­pect the agent to in­stru­men­tally main­tain a be­lief over what the hu­man cares about, and gather in­for­ma­tion to re­fine this be­lief. We might hope that there are ways to achieve in­tent al­ign­ment that in­stru­men­tally in­cen­tivizes all the nice be­hav­iors of as­sis­tance games, with­out re­quiring the mod­el­ing as­sump­tions that CIRL does (e.g. that the hu­man has a fixed known re­ward func­tion).

Changes I’d make to my com­ment:

It iso­lates the ma­jor, ur­gent difficulty in a sin­gle sub­prob­lem. If we make an AI sys­tem that tries to do what we want, it could cer­tainly make mis­takes, but it seems much less likely to cause eg. hu­man ex­tinc­tion.

I still think that the in­tent al­ign­ment /​ mo­ti­va­tion prob­lem is the most ur­gent, but there are cer­tainly other prob­lems that mat­ter as well, so I would prob­a­bly re­move or clar­ify that point.

• Ul­ti­mately, our goal is to build AI sys­tems that do what we want them to do. One way of de­com­pos­ing this is first to define the be­hav­ior that we want from an AI sys­tem, and then to figure out how to ob­tain that be­hav­ior, which we might call the defi­ni­tion-op­ti­miza­tion de­com­po­si­tion. Am­bi­tious value learn­ing aims to solve the defi­ni­tion sub­prob­lem. I in­ter­pret this post as propos­ing a differ­ent de­com­po­si­tion of the over­all prob­lem. One sub­prob­lem is how to build an AI sys­tem that is try­ing to do what we want, and the sec­ond sub­prob­lem is how to make the AI com­pe­tent enough that it ac­tu­ally does what we want. I like this mo­ti­va­tion-com­pe­tence de­com­po­si­tion for a few rea­sons:

• It iso­lates the ma­jor, ur­gent difficulty in a sin­gle sub­prob­lem. If we make an AI sys­tem that tries to do what we want, it could cer­tainly make mis­takes, but it seems much less likely to cause eg. hu­man ex­tinc­tion. (Though it is cer­tainly pos­si­ble, for ex­am­ple by build­ing an un­al­igned suc­ces­sor AI sys­tem, as men­tioned in the post.) In con­trast, with the defi­ni­tion-op­ti­miza­tion de­com­po­si­tion, we need to solve both speci­fi­ca­tion prob­lems with the defi­ni­tion and ro­bust­ness prob­lems with the op­ti­miza­tion.

• Hu­mans seem to solve the mo­ti­va­tion sub­prob­lem, whereas hu­mans don’t seem to solve ei­ther the defi­ni­tion or the op­ti­miza­tion sub­prob­lems. I can definitely imag­ine a hu­man le­gi­t­i­mately try­ing to help me, whereas I can’t re­ally imag­ine a hu­man know­ing how to de­rive op­ti­mal be­hav­ior for my goals, nor can I imag­ine a hu­man that can ac­tu­ally perform the op­ti­mal be­hav­ior to achieve some ar­bi­trary goal.

• It is eas­ier to ap­ply to sys­tems with­out much ca­pa­bil­ity, though as the post notes, it prob­a­bly still does need to have some level of ca­pa­bil­ity. While a digit recog­ni­tion sys­tem is use­ful, it doesn’t seem mean­ingful to talk about whether it is “try­ing” to help us.

• Re­lat­edly, the safety guaran­tees seem to de­grade more slowly and smoothly. With defi­ni­tion-op­ti­miza­tion, if you get the defi­ni­tion even slightly wrong, Good­hart’s Law sug­gests that you can get very bad out­comes. With mo­ti­va­tion-com­pe­tence, I’ve already ar­gued that in­com­pe­tence prob­a­bly leads to small prob­lems, not big ones, and slightly worse mo­ti­va­tion might not make a huge differ­ence be­cause of some­thing analo­gous to the basin of at­trac­tion around cor­rigi­bil­ity. This de­pends a lot on what “slightly worse” means for mo­ti­va­tion, but I’m op­ti­mistic.

• We’ve been work­ing with the defi­ni­tion-op­ti­miza­tion de­com­po­si­tion for quite some time now by mod­el­ing AI sys­tems as ex­pected util­ity max­i­miz­ers, and we’ve found a lot of nega­tive re­sults and not very many pos­i­tive ones.

• The mo­ti­va­tion-com­pe­tence de­com­po­si­tion ac­com­mo­dates in­ter­ac­tion be­tween the AI sys­tem and hu­mans, which defi­ni­tion-op­ti­miza­tion does not al­low (or at least, it makes it awk­ward to in­clude such in­ter­ac­tion).

The cons are:

• It is im­pre­cise and in­for­mal, whereas we can use the for­mal­ism of ex­pected util­ity max­i­miz­ers for the defi­ni­tion-op­ti­miza­tion de­com­po­si­tion.

• There hasn’t been much work done in this paradigm, so it is not ob­vi­ous that there is progress to make.

• I sus­pect many re­searchers would ar­gue that any suffi­ciently in­tel­li­gent sys­tem will be well-mod­eled as an ex­pected util­ity max­i­mizer and will have goals and prefer­ences it is op­ti­miz­ing for, and as a re­sult we need to deal with the prob­lems of ex­pected util­ity max­i­miz­ers any­way. Per­son­ally, I do not find this ar­gu­ment com­pel­ling, and hope to write about why in the near fu­ture. ETA: Writ­ten up in the chap­ter on Goals vs Utility Func­tions in the Value Learn­ing se­quence, par­tic­u­larly in Co­her­ence ar­gu­ments do not im­ply goal-di­rected be­hav­ior.

• This is a great com­ment, and maybe it should even be its own post. It clar­ified a bunch of things for me, and I think was the best con­cise ar­gu­ment for “we should try to build some­thing that doesn’t look like an ex­pected util­ity max­i­mizer” that I’ve read so far.

• I agree with habryka that this is a re­ally good ex­pla­na­tion. I also agree with most of your pros and cons, but for me an­other ma­jor con is that this de­com­po­si­tion moves some prob­lems that I think are cru­cial and ur­gent out of “AI al­ign­ment” and into the “com­pe­tence” part, with the im­plicit or ex­plicit im­pli­ca­tion that they are not as im­por­tant, for ex­am­ple the prob­lem of ob­tain­ing or helping hu­mans to ob­tain a bet­ter un­der­stand­ing of their val­ues and defend­ing their val­ues against ma­nipu­la­tion from other AIs.

In other words, the mo­ti­va­tion-com­pe­tence de­com­po­si­tion seems po­ten­tially very use­ful to me as a way to break down a larger prob­lem into smaller parts so it can be solved more eas­ily, but I don’t agree that the ur­gent/​not-ur­gent di­vide lines up neatly with the mo­ti­va­tion/​com­pe­tence di­vide.

Aside from the prac­ti­cal is­sue of con­fu­sion be­tween differ­ent us­ages of “AI al­ign­ment” (I think oth­ers like MIRI had been us­ing “AI al­ign­ment” in a broader sense be­fore Paul came up with his nar­rower defi­ni­tion), even us­ing “AI al­ign­ment” in a con­text where it’s clear that I’m us­ing Paul’s defi­ni­tion gives me the feel­ing that I’m im­plic­itly agree­ing to his un­der­stand­ing of how var­i­ous sub­prob­lems should be pri­ori­tized.

• Aside from the prac­ti­cal is­sue of con­fu­sion be­tween differ­ent us­ages of “AI al­ign­ment” (I think oth­ers like MIRI had been us­ing “AI al­ign­ment” in a broader sense be­fore Paul came up with his nar­rower defi­ni­tion)

I switched to this us­age of AI al­ign­ment in 2017, af­ter an email thread in­volv­ing many MIRI peo­ple where Rob sug­gested us­ing “AI al­ign­ment” to re­fer to what Bostrom calls the “sec­ond prin­ci­pal-agent prob­lem” (he ob­jected to my use of “con­trol”). I think I mi­s­un­der­stood what Rob in­tended in that dis­cus­sion, but my defi­ni­tion is meant to be in line with that—if the agent is try­ing to do what the prin­ci­pal wants, it seem like you’ve solved the prin­ci­pal-agent prob­lem. I think the main way this defi­ni­tion is nar­rower than what was dis­cussed in that email thread is by ex­clud­ing things like box­ing.

In prac­tice, es­sen­tially all of MIRI’s work seems to fit within this nar­rower defi­ni­tion, so I’m not too con­cerned at the mo­ment with this prac­ti­cal is­sue (I don’t know of any work MIRI feels strongly about that doesn’t fit in this defi­ni­tion). We had a thread about this af­ter it came up on LW in April, where we kind of de­cided to stick with some­thing like “ei­ther make the AI try­ing to do the right thing, or some­how cope with the prob­lems in­tro­duced by it try­ing to do the wrong thing” (so in­clud­ing things like box­ing), but to mostly not worry too much since in prac­tice ba­si­cally the same prob­lems are un­der both cat­e­gories.

I should have up­dated this post be­fore it got re­run as part of the se­quence.

• Note that Ar­bital defines “AI al­ign­ment” as:

The “al­ign­ment prob­lem for ad­vanced agents” or “AI al­ign­ment” is the over­ar­ch­ing re­search topic of how to de­velop suffi­ciently ad­vanced ma­chine in­tel­li­gences such that run­ning them pro­duces good out­comes in the real world.

and “to­tal al­ign­ment” as:

An ad­vanced agent can be said to be “to­tally al­igned” when it can as­sess the ex­act value of well-de­scribed out­comes and hence the ex­act sub­jec­tive value of ac­tions, poli­cies, and plans; where value has its over­rid­den mean­ing of a meta­syn­tac­tic vari­able stand­ing in for “what­ever we re­ally do or re­ally should value in the world or want from an Ar­tifi­cial In­tel­li­gence” (this is the same as “nor­ma­tive” if the speaker be­lieves in nor­ma­tivity).

I think this clearly in­cludes the kinds of prob­lems I’m talk­ing about in this thread. Do you agree? Also sup­port­ing my view is the his­tory of “Friendli­ness” be­ing a term that in­cluded the prob­lem of bet­ter un­der­stand­ing the user’s val­ues (as in CEV) and then MIRI giv­ing up that term in fa­vor of “al­ign­ment” as an ap­par­ently ex­act syn­onym. See this MIRI post which talks about “full al­ign­ment prob­lem for fully au­tonomous AGI sys­tems” and links to Ar­bital.

In prac­tice, es­sen­tially all of MIRI’s work seems to fit within this nar­rower defi­ni­tion, so I’m not too con­cerned at the mo­ment with this prac­ti­cal issue

I think you may have mi­s­un­der­stood what I meant by “prac­ti­cal is­sue”. My point was that if you say some­thing like “I think AI al­ign­ment is the most ur­gent prob­lem to work on” the listener could eas­ily mis­in­ter­pret you as mean­ing “al­ign­ment” in the MIRI/​Ar­bital sense. Or if I say “AI al­ign­ment is the most ur­gent prob­lem to work on” in the MIRI/​Ar­bital sense of al­ign­ment, the listener could eas­ily mis­in­ter­pret as mean­ing “al­ign­ment” your sense.

Again my feel­ing is that MIRI started us­ing al­ign­ment in the broader sense first and there­fore that defi­ni­tion ought to have pri­or­ity. If you dis­agree with this, I could try to do some more his­tor­i­cal re­search to show this. (For ex­am­ple by figur­ing out when those Ar­bital ar­ti­cles were writ­ten, which I cur­rently don’t know how to do.)

• Again my feel­ing is that MIRI started us­ing al­ign­ment in the broader sense first and there­fore that defi­ni­tion ought to have pri­or­ity. If you dis­agree with this, I could try to do some more his­tor­i­cal re­search to show this. (For ex­am­ple by figur­ing out when those Ar­bital ar­ti­cles were writ­ten, which I cur­rently don’t know how to do.)

I think MIRI’s first use of this term was here where they said “We call a smarter-than-hu­man sys­tem that re­li­ably pur­sues benefi­cial goals al­igned with hu­man in­ter­ests’ or sim­ply al­igned.′ ” which is ba­si­cally the same as my defi­ni­tion. (Per­haps slightly weaker, since “do what the user wants you to do” is just one benefi­cial goal.) This talk never defines al­ign­ment, but the slide in­tro­duc­ing the big pic­ture says “Take-home mes­sage: We’re afraid it’s go­ing to be tech­ni­cally difficult to point AIs in an in­tu­itively in­tended di­rec­tion” which also re­ally sug­gests it’s about try­ing to point your AI in the right di­rec­tion.

The ac­tual dis­cus­sion on that Ar­bital page strongly sug­gests that al­ign­ment is about point­ing an AI in a di­rec­tion, though I sup­pose that may merely be an in­stance of sug­ges­tively nam­ing the field “al­ign­ment” and then defin­ing it to be “what­ever is im­por­tant” as a way of smug­gling in the con­no­ta­tion that point­ing your AI in the right di­rec­tion is the im­por­tant thing. All of the top­ics in the “AI al­ign­ment” do­main (ex­cept for mind­crime, which is bor­der­line) all fit un­der the nar­rower defi­ni­tion; the list of al­ign­ment re­searchers are all peo­ple work­ing on the nar­rower prob­lem.

So I think the way this term is used in prac­tice ba­si­cally matches this nar­rower defi­ni­tion.

As I men­tioned, I was pre­vi­ously hap­pily us­ing the term “AI con­trol.” Rob Bens­inger sug­gested that I stop us­ing that term and in­stead use AI al­ign­ment, propos­ing a defi­ni­tion of al­ign­ment that seemed fine to me.

I don’t think the very broad defi­ni­tion is what al­most any­one has in mind when they talk about al­ign­ment. It doesn’t seem to be match­ing up with re­al­ity in any par­tic­u­lar way, ex­cept in­so­far as its cap­tur­ing the prob­lems that a cer­tain group of peo­ple work on.” I don’t re­ally see any ar­gu­ment in fa­vor ex­cept the his­tor­i­cal prece­dent, which I think is du­bi­ous in light of all of the con­flict­ing defi­ni­tions, the ac­tual us­age, and the ex­plicit move to stan­dard­ize on “al­ign­ment” where an al­ter­na­tive defi­ni­tion was pro­posed.

(In the dis­cus­sion, the com­pro­mise defi­ni­tion sug­gested was “cope with the fact that the AI is not try­ing to do what we want it to do, ei­ther by al­ign­ing in­cen­tives or by miti­gat­ing the effects of mis­al­ign­ment.”)

The “al­ign­ment prob­lem for ad­vanced agents” or “AI al­ign­ment” is the over­ar­ch­ing re­search topic of how to de­velop suffi­ciently ad­vanced ma­chine in­tel­li­gences such that run­ning them pro­duces good out­comes in the real world.

Is this in­tended (/​ do you un­der­stand this) to in­clude things like “make your AI bet­ter at pre­dict­ing the world,” since we ex­pect that agents who can make bet­ter pre­dic­tions will achieve bet­ter out­comes?

If this isn’t in­cluded, is that be­cause “suffi­ciently ad­vanced” in­cludes mak­ing good pre­dic­tions? Or be­cause of the em­piri­cal view that abil­ity to pre­dict the world isn’t an im­por­tant in­put into pro­duc­ing good out­comes? Or some­thing else?

If this defi­ni­tion doesn’t dis­t­in­guish al­ign­ment from ca­pa­bil­ities, then that seems like a non-starter to me which is nei­ther use­ful nor cap­tures the typ­i­cal us­age.

If this ex­cludes mak­ing bet­ter pre­dic­tion be­cause that’s as­sumed by “suffi­ciently ad­vanced agent,” then I have all sorts of other ques­tions (does “suffi­ciently ad­vanced” in­clude all par­tic­u­lar em­piri­cal knowl­edge rele­vant to mak­ing the world bet­ter? does it in­clude some ar­bi­trary cat­e­gory not ex­plic­itly carved out in the defi­ni­tion?)

In gen­eral, the al­ter­na­tive broader us­age of AI al­ign­ment is broad enough to cap­ture lots of prob­lems that would ex­ist whether or not we built AI. That’s not so differ­ent from us­ing the term to cap­ture (say) physics prob­lems that would ex­ist whether or not we built AI, both feel bad to me.

In­de­pen­dently of this is­sue, it seems like “the kinds of prob­lems you are talk­ing about in this thread” need bet­ter de­scrip­tions whether or not they are part of al­ign­ment (since even if they are part of al­ign­ment, they will cer­tainly in­volve to­tally differ­ent tech­niques/​skills/​im­pact eval­u­a­tions/​out­comes/​etc.).

• The ac­tual dis­cus­sion on that Ar­bital page strongly sug­gests that al­ign­ment is about point­ing an AI in a direction

But the page in­cludes:

“AI al­ign­ment the­ory” is meant as an over­ar­ch­ing term to cover the whole re­search field as­so­ci­ated with this prob­lem, in­clud­ing, e.g., the much-de­bated at­tempt to es­ti­mate how rapidly an AI might gain in ca­pa­bil­ity once it goes over var­i­ous par­tic­u­lar thresh­olds.

which seems to be out­side of just “point­ing an AI in a di­rec­tion”

Is this in­tended (/​​ do you un­der­stand this) to in­clude things like “make your AI bet­ter at pre­dict­ing the world,” since we ex­pect that agents who can make bet­ter pre­dic­tions will achieve bet­ter out­comes?

I think so, at least for cer­tain kinds of pre­dic­tions that seem es­pe­cially im­por­tant (i.e., may lead to x-risk if done badly), see this Ar­bital page which is un­der AI Align­ment:

Vingean re­flec­tion is rea­son­ing about cog­ni­tive sys­tems, es­pe­cially cog­ni­tive sys­tems very similar to your­self (in­clud­ing your ac­tual self), un­der the con­straint that you can’t pre­dict the ex­act fu­ture out­puts. We need to make pre­dic­tions about the con­se­quence of op­er­at­ing an agent in an en­vi­ron­ment via rea­son­ing on some more ab­stract level, some­how.

If this defi­ni­tion doesn’t dis­t­in­guish al­ign­ment from ca­pa­bil­ities, then that seems like a non-starter to me which is nei­ther use­ful nor cap­tures the typ­i­cal us­age.

It seems to me that Ro­hin’s pro­posal of dis­t­in­guish­ing be­tween “mo­ti­va­tion” and “ca­pa­bil­ities” is a good one, and then we can keep us­ing “al­ign­ment” for the set of broader prob­lems that are in line with the MIRI/​Ar­bital defi­ni­tion and ex­am­ples.

In gen­eral, the al­ter­na­tive broader us­age of AI al­ign­ment is broad enough to cap­ture lots of prob­lems that would ex­ist whether or not we built AI. That’s not so differ­ent from us­ing the term to cap­ture (say) physics prob­lems that would ex­ist whether or not we built AI, both feel bad to me.

It seems fine to me to in­clude 1) prob­lems that are greatly ex­ac­er­bated by AI and 2) prob­lems that aren’t caused by AI but may be best solved/​ame­lio­rated by some el­e­ment of AI de­sign, since these are prob­lems that AI re­searchers have a re­spon­si­bil­ity over and/​or can po­ten­tially con­tribute to. If there’s a prob­lem that isn’t ex­ac­er­bated by AI and does not seem likely to have a solu­tion within AI de­sign then I’d not in­clude that.

In­de­pen­dently of this is­sue, it seems like “the kinds of prob­lems you are talk­ing about in this thread” need bet­ter de­scrip­tions whether or not they are part of al­ign­ment (since even if they are part of al­ign­ment, they will cer­tainly in­volve to­tally differ­ent tech­niques/​​skills/​​im­pact eval­u­a­tions/​​out­comes/​​etc.).

Sure, agreed.

• for me an­other ma­jor con is that this de­com­po­si­tion moves some prob­lems that I think are cru­cial and ur­gent out of “AI al­ign­ment” and into the “com­pe­tence” part, with the im­plicit or ex­plicit im­pli­ca­tion that they are not as im­por­tant, for ex­am­ple the prob­lem of ob­tain­ing or helping hu­mans to ob­tain a bet­ter un­der­stand­ing of their val­ues and defend­ing their val­ues against ma­nipu­la­tion from other AIs.

I think it’s bad to use a defi­ni­tional move to try to im­plic­itly pri­ori­tize or de­pri­ori­tize re­search. I think I shouldn’t have writ­ten: “I like it less be­cause it in­cludes many sub­prob­lems that I think (a) are much less ur­gent, (b) are likely to in­volve to­tally differ­ent tech­niques than the ur­gent part of al­ign­ment.”

That said, I do think it’s im­por­tant that these seem like con­cep­tu­ally differ­ent prob­lems and that differ­ent peo­ple can have differ­ent views about their rel­a­tive im­por­tance—I re­ally want to dis­cuss them sep­a­rately, try to solve them sep­a­rately, com­pare their rel­a­tive val­ues (and sep­a­rate that from at­tempts to work on ei­ther).

I don’t think it’s ob­vi­ous that al­ign­ment is higher pri­or­ity than these prob­lems, or than other as­pects of safety. I mostly think it’s a use­ful cat­e­gory to be able to talk about sep­a­rately. In gen­eral I think that it’s good to be able to sep­a­rate con­cep­tu­ally sep­a­rate cat­e­gories, and I care about that par­tic­u­larly much in this case be­cause I care par­tic­u­larly much about this prob­lem. But I also grant that the term has in­er­tia be­hind it and so choos­ing its defi­ni­tion is a bit loaded and so some­one could ob­ject on those grounds even if they bought that it was a use­ful sep­a­ra­tion.

(I think that “defend­ing their val­ues against ma­nipu­la­tion from other AIs” wasn’t in­clude un­der any of the defi­ni­tions of “al­ign­ment” pro­posed by Rob in our email dis­cus­sion about pos­si­ble defi­ni­tions, so it doesn’t seem to­tally cor­rect to re­fer to this as “mov­ing” those sub­prob­lems, so much as there already ex­ist­ing a mess of im­pre­cise defi­ni­tions some of which in­cluded and some of which ex­cluded those sub­prob­lems.)

• Yeah, that seems right. I would prob­a­bly defend the claim that mo­ti­va­tion con­tains the most ur­gent part in the same way that Paul has done in the past—it seems likely to be easy to get a well mo­ti­vated AI sys­tem to re­al­ize that it should help us un­der­stand our val­ues, and that it should not do ir­re­versible high-im­pact ac­tions un­til then. I’m less op­ti­mistic about defend­ing val­ues against ma­nipu­la­tion, be­cause you prob­a­bly need to be very com­pe­tent for that, and you can’t take your time to be­come more com­pe­tent, but that seems like a fur­ther-away prob­lem to me and so less ur­gent.

(I don’t think I have much to add over the dis­cus­sions you and Paul have had in the past, but I’m happy to clar­ify my opinion if it seems use­ful to you—per­haps my way of stat­ing things will click where Paul’s way didn’t, idk. Or I might have differ­ent opinions and not re­al­ize it.)

I would sup­port the idea of hav­ing this idea sim­ply as a de­com­po­si­tion and not also pack in the im­pli­ca­tion that mo­ti­va­tion/​com­pe­tence cor­re­sponds to ur­gent/​not-ur­gent, though I sus­pect it is quite hard to do that now.

• I’m happy to clar­ify my opinion if it seems use­ful to you—per­haps my way of stat­ing things will click where Paul’s way didn’t

I would highly wel­come that. BTW if you see me ar­gue with Paul in the fu­ture (or in the past) and I seem to be not get­ting some­thing, please feel free to jump in and ex­plain it a differ­ent way. I of­ten find it eas­ier to un­der­stand one of Paul’s ideas from some­one else’s ex­pla­na­tion.

it seems likely to be easy to get a well mo­ti­vated AI sys­tem to re­al­ize that it should help us un­der­stand our values

Yes, that seems easy, but ac­tu­ally helping seems much harder.

and that it should not do ir­re­versible high-im­pact ac­tions un­til then

How do you de­ter­mine what is “high-im­pact” be­fore you have a util­ity func­tion? Even “re­versible” is rel­a­tive to a util­ity func­tion, right? It doesn’t mean that you liter­ally can re­verse all the con­se­quences of an ac­tion, but rather that you can re­verse the im­pact of that ac­tion on your util­ity?

It seems to me that “avoid ir­re­versible high-im­pact ac­tions” would only work if one had a small amount of un­cer­tainty over one’s util­ity func­tion, in which case you could just avoid ac­tions that are con­sid­ered “ir­re­versible high-im­pact” by any the util­ity func­tions that you have sig­nifi­cant prob­a­bil­ity mass on. But if you had a large amount of un­cer­tainty, or just have very lit­tle idea what your util­ity func­tion looks like, that doesn’t work be­cause al­most any ac­tion could be “ir­re­versible high-im­pact”. For ex­am­ple if I were a nega­tive util­i­tar­ian I per­haps ought to spend all my re­sources try­ing to stop tech­nolog­i­cal progress lead­ing to space coloniza­tion, so any­thing that I do be­sides that would be “ir­re­versible high-im­pact” un­less I could go back in time and change my re­source al­lo­ca­tion.

BTW, here is a sec­tion from a draft post that I’m work­ing on. Do you think it would be easy to solve or avoid all of these prob­lems? (This post isn’t speci­fi­cally ad­dress­ing Paul’s ap­proach so some of them may be easy to solve un­der his ap­proach but I don’t think all of them are.)

How to pre­vent “al­igned” AIs from un­in­ten­tion­ally cor­rupt­ing hu­man val­ues? We know that ML sys­tems tend to have prob­lems with ad­ver­sar­ial ex­am­ples and dis­tri­bu­tional shifts in gen­eral. There seems to be no rea­son not to ex­pect that hu­man value func­tions have similar prob­lems, which even “al­igned” AIs could trig­ger un­less they are some­how de­signed not to. For ex­am­ple, such AIs could give hu­mans so much power so quickly or put them in such novel situ­a­tions that their moral de­vel­op­ment can’t keep up, so their value sys­tems no longer give sen­si­ble an­swers. (Sort of the AI as­sisted ver­sion of the clas­sic “power cor­rupts” prob­lem.) AIs could give us new op­tions that are ir­re­sistible to some parts of our mo­ti­va­tional sys­tems, like more pow­er­ful ver­sions of video game and so­cial me­dia ad­dic­tion. Even in the course of try­ing to figure out how the world could be made bet­ter for us, they could in effect be search­ing for ad­ver­sar­ial ex­am­ples on our value func­tions. Fi­nally, at our own re­quest or in a sincere at­tempt to help us, they could gen­er­ate philo­soph­i­cal or moral ar­gu­ments that are wrong but ex­tremely per­sua­sive.

(Some of these is­sues, like the in­ven­tion of new ad­dic­tions and new tech­nolo­gies in gen­eral, would hap­pen even with­out AI, but I think AIs would likely, by de­fault, strongly ex­ac­er­bate the prob­lem by differ­en­tially ac­cel­er­at­ing such tech­nolo­gies faster than progress in un­der­stand­ing how to avoid or safely han­dle them.)

I’m less op­ti­mistic about defend­ing val­ues against ma­nipu­la­tion, be­cause you prob­a­bly need to be very com­pe­tent for that, and you can’t take your time to be­come more com­pe­tent, but that seems like a fur­ther-away prob­lem to me and so less ur­gent.

Why is that a fur­ther-away prob­lem? Even if it is, we still need peo­ple to work on them now, if only to gen­er­ate per­sua­sive ev­i­dence in case they re­ally are very hard prob­lems so we can pur­sue some other strat­egy to avoid them like stop­ping or de­lay­ing the de­vel­op­ment of ad­vanced AI as much as pos­si­ble.

• How to pre­vent “al­igned” AIs from un­in­ten­tion­ally cor­rupt­ing hu­man val­ues? We know that ML sys­tems tend to have prob­lems with ad­ver­sar­ial ex­am­ples and dis­tri­bu­tional shifts in gen­eral. There seems to be no rea­son not to ex­pect that hu­man value func­tions have similar prob­lems, which even “al­igned” AIs could trig­ger un­less they are some­how de­signed not to. For ex­am­ple, such AIs could give hu­mans so much power so quickly or put them in such novel situ­a­tions that their moral de­vel­op­ment can’t keep up, so their value sys­tems no longer give sen­si­ble an­swers. (Sort of the AI as­sisted ver­sion of the clas­sic “power cor­rupts” prob­lem.) AIs could give us new op­tions that are ir­re­sistible to some parts of our mo­ti­va­tional sys­tems, like more pow­er­ful ver­sions of video game and so­cial me­dia ad­dic­tion. Even in the course of try­ing to figure out how the world could be made bet­ter for us, they could in effect be search­ing for ad­ver­sar­ial ex­am­ples on our value func­tions. Fi­nally, at our own re­quest or in a sincere at­tempt to help us, they could gen­er­ate philo­soph­i­cal or moral ar­gu­ments that are wrong but ex­tremely per­sua­sive.

My po­si­tion on this (that might be clear from pre­vi­ous dis­cus­sions):

• I agree this is a real prob­lem.

• From a tech­ni­cal per­spec­tive, I think this is even fur­ther from the al­ign­ment prob­lem (than other AI safety prob­lems), so I definitely think it should be stud­ied sep­a­rately and de­serves a sep­a­rate name.(Though the last bul­let point in this com­ment im­plic­itly gives an ar­gu­ment in the other di­rec­tion.)

• I’d nor­mally frame this prob­lem as “so­ciety’s val­ues will evolve over time, and we have prefer­ences about how they evolve.” New tech­nol­ogy might change things in ways we don’t en­dorse. Nat­u­ral pres­sures like death may lead to changes we don’t en­dorse (though that’s a tricky val­ues call). The con­straint of re­main­ing eco­nom­i­cally/​mil­i­tar­ily com­pet­i­tive could also force our val­ues to evolve in a bad way (al­ign­ment is an in­stance of that prob­lem, and even­tu­ally AI+al­ign­ment would ad­dress the other nat­u­ral in­stance by de­cou­pling hu­man val­ues from the com­pe­tence needed to re­main com­pet­i­tive). And of course there is a hard prob­lem in that we don’t know how to de­liber­ate/​re­flect. The “figure out how to de­liber­ate” prob­lem seems like it is rel­a­tively eas­ily post­poned, since you don’t have to solve it un­til you are do­ing de­liber­a­tion, but the “help peo­ple avoid er­rors in de­liber­a­tion” may be more ur­gent.

• The rea­son I con­sider al­ign­ment more ur­gent is en­tirely quan­ti­ta­tive and very em­piri­cally con­tin­gent, I don’t think there is any sim­ple ar­gu­ment against. I think there is a >1/​3 chance that AI will be solidly su­per­hu­man within 20 sub­jec­tive years, and that in those sce­nar­ios al­ign­ment de­stroys maybe 20% of the to­tal value of the fu­ture, lead­ing to 0.3%/​year of losses from al­ign­ment, and right now it looks rea­son­ably tractable. In­fluenc­ing the tra­jec­tory of so­ciety’s val­ues in other ways seems sig­nifi­cantly worse than that to me (maybe 10x less cost-effec­tive?). I think it would be use­ful to do some back-of-the-en­velope calcu­la­tions for the sever­ity of value drift and the case for work­ing on it.

• I don’t think I’m likely to work on this prob­lem un­less I ei­ther be­come much more pes­simistic about work­ing on al­ign­ment (e.g. be­cause the prob­lem is much harder or eas­ier than I cur­rently be­lieve), I feel like I’ve already poked at it enough that VOI from more pok­ing is lower than just charg­ing ahead on al­ign­ment. But that is a stronger judg­ment than the last sec­tion, and I think is largely due to com­par­a­tive ad­van­tage con­sid­er­a­tions, and I would cer­tainly be sup­port­ive of work on this topic (e.g. would be happy to fund, would en­gage with it, etc.)

• This is a lead­ing con­tender for what I would do if al­ign­ment seemed un­ap­peal­ing, though I think that broader in­sti­tu­tional im­prove­ment /​ ca­pa­bil­ity en­hance­ment /​ etc. seems more ap­peal­ing. I’d definitely spend more time think­ing about it.

• I think that im­por­tant ver­sions of these prob­lems re­ally do ex­ist with or with­out AI, al­though I agree that AI will ac­cel­er­ate the point at which they be­come crit­i­cal while it’s not ob­vi­ous whether it will ac­cel­er­ate solu­tions. I don’t think this is par­tic­u­larly im­por­tant but does make me feel even more com­fortable with the nam­ing is­sue—this isn’t re­ally a prob­lem about AI at all, it’s just one of many is­sues that is mod­u­lated by AI.

• I think the main way AI is rele­vant to the cost-effec­tive­ness anal­y­sis of shap­ing-the-evolu­tion-of-val­ues is that it may de­crease the amount of work that can be done on these prob­lems be­tween now and when they be­come se­ri­ous (if AI is effec­tively ac­cel­er­at­ing the timeline for catas­trophic value change with­out ac­cel­er­at­ing work on mak­ing val­ues evolve in a way we’d en­dorse).

• To the ex­tent that the value of work­ing on these prob­lems is dom­i­nated by that sce­nario—”AI has a large com­par­a­tive dis­ad­van­tage at helping us solve philo­soph­i­cal prob­lems /​ think­ing about long-term tra­jec­tory /​ etc.”—then I think that one of the most promis­ing in­ter­ven­tions on this prob­lem is im­prov­ing the rel­a­tive ca­pa­bil­ity of AI at prob­lems of this form. My cur­rent view is that work­ing on fac­tored cog­ni­tion (and similarly on de­bate) is a rea­son­able ap­proach to that. This isn’t a su­per im­por­tant con­sid­er­a­tion, but it over­all makes me (a) a bit more ex­cited about fac­tored cog­ni­tion (es­pe­cially in wor­lds where the broader iter­ated am­plifi­ca­tion pro­gram breaks down), (b) a bit less con­cerned about figur­ing out whether rel­a­tive ca­pa­bil­ities is more or less im­por­tant than al­ign­ment.

• I would like to have clearer ways of talk­ing and think­ing about these prob­lems, but (a) I think the next step is prob­a­bly de­vel­op­ing a bet­ter un­der­stand­ing (or, if some­one has a much bet­ter un­der­stand­ing, then a de­vel­op­ment of a bet­ter shared un­der­stand­ing), (b) I re­ally want a word other than “al­ign­ment,” and prob­a­bly mul­ti­ple words. I guess the one that feels most ur­gently-un­named right now is some­thing like: un­der­stand­ing how val­ues evolve and what fea­tures may in­tro­duce that evolu­tion in a way we don’t en­dorse, in­clud­ing both so­cial dy­nam­ics, en­vi­ron­men­tal fac­tors, the need to re­main com­pet­i­tive, and the dy­nam­ics of de­liber­a­tion and ar­gu­men­ta­tion.

• I’d nor­mally frame this prob­lem as “so­ciety’s val­ues will evolve over time, and we have prefer­ences about how they evolve.”

This state­ment of the prob­lem seems to as­sume a sub­jec­tivist or anti-re­al­ist view of metaethics (items 4 or 5 on this list). Con­sider the analo­gous state­ment, “math­e­mat­i­ci­ans’ be­liefs about math­e­mat­i­cal state­ments will evolve over time, and they have prefer­ences about how their be­liefs evolve”. I think a lot of math­e­mat­i­ci­ans would ob­ject to that and in­stead say that they pre­fer to have true be­liefs about math­e­mat­ics, and their “prefer­ences about how their be­liefs evolve” are just their best guesses about how to ar­rive at true be­liefs.

As­sum­ing you agree that we can’t be cer­tain about which metaeth­i­cal po­si­tion is cor­rect yet, I think by im­plic­itly adopt­ing a sub­jec­tivist/​anti-re­al­ist fram­ing, you make the prob­lem seem eas­ier than we should ex­pect it to be. It im­plies that in­stead of the AI (and in­di­rectly the AI de­signer) po­ten­tially hav­ing (if a re­al­ist or rel­a­tivist metaeth­i­cal po­si­tion is cor­rect) an obli­ga­tion/​op­por­tu­nity to help the user figure out what their true or nor­ma­tive val­ues are, which may in­volve solv­ing difficult metaeth­i­cal and other philo­soph­i­cal ques­tions, the AI can just fol­low the user’s prefer­ences about how their val­ues evolve.

Ad­di­tion­ally, this fram­ing also makes the po­ten­tial con­se­quences of failing to solve the prob­lem sound less se­ri­ous than it could po­ten­tially be. I.e., if there is such a thing as some­one’s true or nor­ma­tive val­ues, then failing to op­ti­mize the uni­verse for those val­ues is re­ally bad, but if they just have prefer­ences about how their val­ues evolve, then even if their val­ues fail to evolve in that way, at least what­ever val­ues the uni­verse ends up be­ing op­ti­mized for are still their val­ues, so not all is lost.

I think I would pre­fer to frame the prob­lem as “How can we de­sign/​use AI to pre­vent the cor­rup­tion of hu­man val­ues, es­pe­cially cor­rup­tion caused/​ex­ac­er­bated by the de­vel­op­ment of AI?” and would con­sider this an in­stance of the more gen­eral prob­lem “When con­sid­er­ing AI safety, it’s not safe to as­sume that the hu­man user/​op­er­a­tor/​su­per­vi­sor is a gen­er­ally safe agent.”

In­fluenc­ing the tra­jec­tory of so­ciety’s val­ues in other ways seems sig­nifi­cantly worse than that to me (maybe 10x less cost-effec­tive?). I think it would be use­ful to do some back-of-the-en­velope calcu­la­tions for the sever­ity of value drift and the case for work­ing on it.

To me the x-risk of cor­rupt­ing hu­man val­ues by well-mo­ti­vated AI is com­pa­rable to the x-risk caused by badly-mo­ti­vated AI (and both higher than 20% con­di­tional on su­per­hu­man AI within 20 sub­jec­tive years), but I’m not sure how to ar­gue this with you. Even if the to­tal risk of “value cor­rup­tion” is 10x smaller, it seems like the marginal im­pact of an ad­di­tional re­searcher on “value cor­rup­tion” could be higher given that there are now about 20(?) full time re­searchers work­ing mostly on AI mo­ti­va­tion but zero on this prob­lem (as far as I know), and then we also have to con­sider the effect of a marginal re­searcher on the fu­ture growth of each field, and fu­ture effects on pub­lic opinion and policy mak­ers. Un­for­tu­nately, I don’t know how to calcu­late these things even in a back-of-the-en­velope way. As a rule of thumb, “if one x-risk seems X times big­ger than an­other, it should have about X times as many peo­ple work­ing on it” is in­tu­itive ap­peal­ingly to me, and sug­gests we should have at least 2 peo­ple work­ing on “value cor­rup­tion” even if you think that risk is 10x smaller, but I’m not sure if that makes sense to you.

I don’t think I’m likely to work on this prob­lem un­less I ei­ther be­come much more pes­simistic about work­ing on alignment

I see no rea­son to con­vince you per­son­ally to work on “value cor­rup­tion” since your in­tu­ition on the rel­a­tive sever­ity of the risks is so differ­ent from mine, and un­der ei­ther of our views we ob­vi­ously still need peo­ple to work on mo­ti­va­tion /​ al­ign­ment-in-your-sense. I’m just hop­ing that you won’t (in­ten­tion­ally or un­in­ten­tion­ally) dis­cour­age peo­ple from work­ing on “value cor­rup­tion” so strongly that they don’t even con­sider look­ing into that prob­lem and form­ing their own con­clu­sions based on their own in­tu­itions/​pri­ors.

To the ex­tent that the value of work­ing on these prob­lems is dom­i­nated by that sce­nario—“AI has a large com­par­a­tive dis­ad­van­tage at helping us solve philo­soph­i­cal prob­lems /​​ think­ing about long-term tra­jec­tory /​​ etc.“—then I think that one of the most promis­ing in­ter­ven­tions on this prob­lem is im­prov­ing the rel­a­tive ca­pa­bil­ity of AI at prob­lems of this form. My cur­rent view is that work­ing on fac­tored cog­ni­tion (and similarly on de­bate) is a rea­son­able ap­proach to that. This isn’t a su­per im­por­tant con­sid­er­a­tion, but it over­all makes me (a) a bit more ex­cited about fac­tored cog­ni­tion (es­pe­cially in wor­lds where the broader iter­ated am­plifi­ca­tion pro­gram breaks down), (b) a bit less con­cerned about figur­ing out whether rel­a­tive ca­pa­bil­ities is more or less im­por­tant than al­ign­ment.

This seems to­tally rea­son­able to me, but 1) oth­ers may have other ideas about how to in­ter­vene on this prob­lem, and 2) even within fac­tored cog­ni­tion or de­bate there are prob­a­bly re­search di­rec­tions that skew to­wards be­ing more ap­pli­ca­ble to mo­ti­va­tion and re­search di­rec­tions that skew to­wards be­ing more ap­pli­ca­ble to “value cor­rup­tion” and I don’t want peo­ple to be ex­ces­sively dis­cour­aged from work­ing on the lat­ter by state­ments like “mo­ti­va­tion con­tains the most ur­gent part”.

• To me the x-risk of cor­rupt­ing hu­man val­ues by well-mo­ti­vated AI is com­pa­rable to the x-risk caused by badly-mo­ti­vated AI (and both higher than 20% con­di­tional on su­per­hu­man AI within 20 sub­jec­tive years), but I’m not sure how to ar­gue this with you.

If you think this risk is very large, pre­sum­ably there is some pos­i­tive ar­gu­ment for why it’s so large? That seems like the most nat­u­ral way to run the ar­gu­ment. I agree it’s not clear what ex­actly the norms of ar­gu­ment here are, but the very ba­sic one seems to be shar­ing the rea­son for great con­cern.

In the case of al­ign­ment there are a few lines of ar­gu­ment that we can flesh out pretty far. The ba­sic struc­ture is some­thing like: “(a) if we built AI with our cur­rent un­der­stand­ing there is a good chance it would not be try­ing to do what we wanted or have enough over­lap to give the fu­ture sub­stan­tial value, (b) if we built suffi­ciently com­pe­tent AI, the fu­ture would prob­a­bly be shaped by its in­ten­tions, (c) we have a sig­nifi­cant risk of not de­vel­op­ing suffi­ciently bet­ter un­der­stand­ing prior to hav­ing the ca­pa­bil­ity to build suffi­ciently com­pe­tent AI, (d) we have a sig­nifi­cant risk of build­ing suffi­ciently com­pe­tent AI even if we don’t have suffi­ciently good un­der­stand­ing.” (Each of those claims ob­vi­ously re­quires more ar­gu­ment, etc.)

One ver­sion of the case for wor­ry­ing about value cor­rup­tion would be:

• It seems plau­si­ble that the val­ues pur­sued by hu­mans are very sen­si­tive to changes in their en­vi­ron­ment.

• It may be that his­tor­i­cal vari­a­tion is it­self prob­le­matic, and we care mostly about our par­tic­u­lar val­ues.

• Or it may be that val­ues are “hard­ened” against cer­tain kinds of en­vi­ron­ment shift that oc­cur in na­ture, and that they will go to some lower “de­fault” level of ro­bust­ness un­der new kinds of shifts.

• Or it may be that nor­mal vari­a­tion is OK for de­ci­sion-the­o­retic rea­sons (since we are the benefi­cia­ries of past shifts) but new kinds of vari­a­tion are not OK.

• If so, the rate of change in sub­jec­tive time could be rea­son­ably high—per­haps the change that oc­curs within one gen­er­a­tion could shift value far enough to re­duce value by 50% (if that change wasn’t en­dorsed for de­ci­sion-the­o­retic rea­sons /​ hard­ened against).

• It’s plau­si­ble, per­haps 50%, that AI will ac­cel­er­ate kinds of change that lead to value drift rad­i­cally more than it ac­cel­er­ates an un­der­stand­ing of how to pre­vent such drift.

• A good un­der­stand­ing of how to pre­vent value drift might be used /​ be a ma­jor driver of how well we pre­vent such drift. (Or maybe some other fore­see­able in­sti­tu­tional char­ac­ter­is­tics could have a big effect on how much drift oc­curs.)

• If so, then it mat­ters a lot how well we un­der­stand how to pre­vent such drift at the time when we de­velop AI. Per­haps there will be sev­eral gen­er­a­tions worth of sub­jec­tive time /​ drift-driv­ing change be­fore we are able to do enough ad­di­tional la­bor to ob­so­lete our cur­rent un­der­stand­ing (since AI is ac­cel­er­at­ing change but not the rele­vant kind of la­bor).

• Our cur­rent un­der­stand­ing may not be good, and there may be a re­al­is­tic prospect of hav­ing a much bet­ter un­der­stand­ing.

This kind of story is kind of con­junc­tive, so I’d ex­pect to ex­plore a few lines of ar­gu­ment like this, and then try to figure out what are the most im­por­tant un­der­ly­ing un­cer­tain­ties (e.g. steps that ap­pear in most ar­gu­ments of this form, or a more fun­da­men­tal un­der­ly­ing cause for con­cern that gen­er­ates many differ­ent ar­gu­ments).

My most ba­sic con­cerns with this story are things like:

• In “well-con­trol­led” situ­a­tions, with prin­ci­pals who care about this is­sue, it feels like we already have an OK un­der­stand­ing of how to avert drift (con­di­tioned on solv­ing al­ign­ment). It seems like the ba­sic idea is to de­cou­ple evolv­ing val­ues from the events in the world that are ac­tu­ally driv­ing com­pet­i­tive­ness /​ in­ter­act­ing with the nat­u­ral world /​ re­al­iz­ing peo­ple’s con­sump­tion /​ etc., which is di­rectly fa­cil­i­tated by al­ign­ment. The ex­treme form of this is hav­ing some hu­man in a box some­where (or maybe in cold stor­age) who will re­flect and grow on their own sched­ule, and who will ul­ti­mately as­sume con­trol of their re­sources once reach­ing ma­tu­rity. We’ve talked a lit­tle bit about this, and you’ve pointed out some rea­sons this kind of scheme isn’t to­tally satis­fac­tory even if it works as in­tended, but quan­ti­ta­tively the rea­sons you’ve pointed to don’t seem to be prob­a­ble enough (per eco­nomic dou­bling, say) to make the cost-benefit anal­y­sis work out.

• In most prac­ti­cal situ­a­tions, it doesn’t seem like “un­der­stand­ing of how to avert drift” is the key bot­tle­neck to avert­ing drift—it seems like the ba­sic prob­lem is that most peo­ple just don’t care about avert­ing drift at all, or have any in­cli­na­tion to be thought­ful about how their own prefer­ences evolve. That’s still some­thing you can in­ter­vene on, but it feels like a huge morass where you are com­pet­ing with many other forces.

In the end I’m do­ing a pretty rough calcu­la­tion that de­pends on a whole bunch of stuff, but those feel like they are maybe the most likely differ­ences in view /​ places where I have some­thing to say. Over­all I still think this prob­lem is rel­a­tively im­por­tant, but that’s how I get to the in­tu­itive view that it’s maybe ~10x lower im­pact. I would grant the ex­is­tence of (plenty of) peo­ple for whom it’s higher im­pact though.

As a rule of thumb, “if one x-risk seems X times big­ger than an­other, it should have about X times as many peo­ple work­ing on it” is in­tu­itive ap­peal­ingly to me, and sug­gests we should have at least 2 peo­ple work­ing on “value cor­rup­tion” even if you think that risk is 10x smaller, but I’m not sure if that makes sense to you.

I think that seems roughly right, prob­a­bly mod­u­lated by some O(1) fac­tor fac­tor re­flect­ing tractabil­ity or other fac­tors not cap­tured in the to­tal quan­tity of risk—maybe I’d ex­pect us to have 2-10x more re­sources per unit risk de­voted to more tractable risks.

In this case I’d be happy with the recom­men­da­tion of ~10x more peo­ple work­ing on mo­ti­va­tion than on value drift, that feels like the right bal­l­park for ba­si­cally the same rea­son that mo­ti­va­tion feels ~10x more im­pact­ful.

I’m just hop­ing that you won’t (in­ten­tion­ally or un­in­ten­tion­ally) dis­cour­age peo­ple from work­ing on “value cor­rup­tion” so strongly that they don’t even con­sider look­ing into that prob­lem and form­ing their own con­clu­sions based on their own in­tu­itions/​pri­ors. [...] I don’t want peo­ple to be ex­ces­sively dis­cour­aged from work­ing on the lat­ter by state­ments like “mo­ti­va­tion con­tains the most ur­gent part”.

I agree I should be more care­ful about this.

I do think that mo­ti­va­tion con­tains the most ur­gent/​im­por­tant part and feel pretty com­fortable ex­press­ing that view (for the same rea­sons I’m gen­er­ally in­clined to ex­press my views), but could hedge more when mak­ing state­ments like this.

(I think say­ing “X is more ur­gent than Y” is ba­si­cally com­pat­i­ble with the view “There should be 10 peo­ple work­ing on X for each per­son work­ing on Y,” even if one also be­lieves “but ac­tu­ally on the cur­rent mar­gin in­vest­ment in Y might be a bet­ter deal.” Will edit the post to be a bit softer here though.

ETA: ac­tu­ally I think the lan­guage in the post ba­si­cally re­flects what I meant, the broader defi­ni­tion seems worse be­cause it con­tains tons of stuff that is lower pri­or­ity. The nar­rower defi­ni­tion doesn’t con­tain ev­ery prob­lem that is high pri­or­ity, it just con­tains a sin­gle high pri­or­ity prob­lem, which is bet­ter than a re­ally broad bas­ket con­tain­ing a mix of im­por­tant and not-that-im­por­tant stuff. But I will likely write a sep­a­rate post or two at some point about value drift and other im­por­tant prob­lems other than mo­ti­va­tion.)

• If you think this risk is very large, pre­sum­ably there is some pos­i­tive ar­gu­ment for why it’s so large?

Yeah, I didn’t liter­ally mean that I don’t have any ar­gu­ments, but rather that we’ve dis­cussed it in the past and it seems like we didn’t get close to re­solv­ing our dis­agree­ment. I tend to think that Au­mann Agree­ment doesn’t ap­ply to hu­mans, and it’s fine to dis­agree on these kinds of things. Even if agree­ment ought to be pos­si­ble in prin­ci­ple (which again I don’t think is nec­es­sar­ily true for hu­mans), if you think that even from your per­spec­tive the value drift/​cor­rup­tion prob­lem is cur­rently overly ne­glected, then we can come back and re­visit this at an­other time (e.g., when you think there’s too many peo­ple work­ing on this prob­lem, which might never ac­tu­ally hap­pen).

it seems like the ba­sic prob­lem is that most peo­ple just don’t care about avert­ing drift at all, or have any in­cli­na­tion to be thought­ful about how their own prefer­ences evolve

I don’t un­der­stand how this is com­pat­i­ble with only 2% loss from value drift/​cor­rup­tion. Do you per­haps think the ac­tual loss is much big­ger, but al­most cer­tainly we just can’t do any­thing about it, so 2% is how much you ex­pect we can po­ten­tially “save” from value drift/​cor­rup­tion? Or are you tak­ing an anti-re­al­ist po­si­tion and say­ing some­thing like, if some­one doesn’t care about avert­ing drift/​cor­rup­tion, then how­ever their val­ues drift that doesn’t con­sti­tute any loss?

The nar­rower defi­ni­tion doesn’t con­tain ev­ery prob­lem that is high pri­or­ity, it just con­tains a sin­gle high pri­or­ity prob­lem, which is bet­ter than a re­ally broad bas­ket con­tain­ing a mix of im­por­tant and not-that-im­por­tant stuff.

I don’t un­der­stand “bet­ter” in what sense. What­ever it is, why wouldn’t it be even bet­ter to have two terms, one of which is broadly defined so as to in­clude all the prob­lems that might be ur­gent but also in­cludes lower pri­or­ity prob­lems and prob­lems whose pri­or­ity we’re not sure about, and an­other one that is defined to be a spe­cific ur­gent prob­lem. Do you cur­rently have any ob­jec­tions to us­ing “AI al­ign­ment” as the broader term (in line with the MIRI/​Ar­bital defi­ni­tion and ex­am­ples) and “AI mo­ti­va­tion” as the nar­rower term (as sug­gested by Ro­hin)?

• Do you cur­rently have any ob­jec­tions to us­ing “AI al­ign­ment” as the broader term (in line with the MIRI/​Ar­bital defi­ni­tion and ex­am­ples) and “AI mo­ti­va­tion” as the nar­rower term (as sug­gested by Ro­hin)?

Yes:

• The vast ma­jor­ity of ex­ist­ing us­ages of “al­ign­ment” should then be re­placed by “mo­ti­va­tion,” which is more spe­cific and usu­ally just as ac­cu­rate. If you are go­ing to split a term into new terms A and B, and you find that the vast ma­jor­ity of ex­ist­ing us­age should be A, then I claim that “A” should be the one that keeps the old word.

• The word “al­ign­ment” was cho­sen (origi­nally be Stu­art Rus­sell I think) pre­cisely be­cause it is such a good name for the prob­lem of al­ign­ing AI val­ues with hu­man val­ues, it’s a word that cor­rectly evokes what that prob­lem is about. This is also how MIRI origi­nally in­tro­duced the term. (I think they in­tro­duced it here, where they said “We call a smarter-than-hu­man sys­tem that re­li­ably pur­sues benefi­cial goals “al­igned with hu­man in­ter­ests” or sim­ply “al­igned.””) Every­where that any­one talks about al­ign­ment they use the anal­ogy with “point­ing,” and even MIRI folks usu­ally talk about al­ign­ment as if it was mostly or en­tirely about point­ing your AI in the right di­rec­tion.

• In con­trast, “al­ign­ment” doesn’t re­ally make sense as a name for the en­tire field of prob­lems about mak­ing AI good. For the prob­lem of mak­ing AI benefi­cial we already have the even older term “benefi­cial AI,” which re­ally means ex­actly that. In ex­plain­ing why MIRI doesn’t like that term, Rob said

Some of the main things I want from a term are:

A. It clearly and con­sis­tently keeps the fo­cus on sys­tem de­sign and en­g­ineer­ing, and what­ever tech­ni­cal/​con­cep­tual ground­work is needed to suc­ceed at such. I want to make it easy for peo­ple (if they want to) to just hash out those tech­ni­cal is­sues, with­out feel­ing any pres­sure to dive into de­bates about bad ac­tors and in­ter-group dy­nam­ics, or gar­den-va­ri­ety ma­chine ethics and moral philos­o­phy, which carry a lot of de­rail /​ suck-the-en­ergy-out-of-the-room risk.

[…] [“AI safety” or “benefi­cial AI”] doesn’t work so well for A—it’s com­monly used to in­clude things like mi­suse risk.”

• [con­tin­u­ing last point] The pro­posed us­age of “al­ign­ment” doesn’t meet this desider­ata though, it has ex­actly the same prob­lem as “benefi­cial AI,” ex­cept that it’s his­tor­i­cally as­so­ci­ated with this com­mu­nity. In par­tic­u­lar it ab­solutely in­cludes “gar­den-va­ri­ety ma­chine ethics and moral philos­o­phy.” Yes, there is all sorts of stuff that MIRI or I wouldn’t care about that is rele­vant to “benefi­cial” AI, but un­der the pro­posed defi­ni­tion of al­ign­ment it’s also rele­vant to “al­igned” AI. (This state­ment by Rob also makes me think that you wouldn’t in fact be happy with what he at least means by “al­ign­ment,” since I take it you ex­plic­itly mean to in­clude moral philos­o­phy?)

• Peo­ple have in­tro­duced a lot of terms and change terms fre­quently. I’ve changed the lan­guage on my blog mul­ti­ple times at other peo­ple’s re­quest. This isn’t costless, it re­ally does make things more and more con­fus­ing.

• I think “AI mo­ti­va­tion” is not a good term for this area of study: it (a) sug­gests it’s about the study of AI mo­ti­va­tion rather than en­g­ineer­ing AI to be mo­ti­vated to help hu­mans, (b) is go­ing to be per­ceived as ag­gres­sively an­thro­po­mor­phiz­ing (even if “al­ign­ment” is only slightly bet­ter), (c) is gen­er­ally less op­ti­mized (re­lated to the sec­ond point above, “al­ign­ment” is quite a good term for this area).

• Prob­a­bly “al­ign­ment” /​ “value al­ign­ment” would be a bet­ter split of terms than “al­ign­ment” vs. “mo­ti­va­tion”. “Value al­ign­ment” has tra­di­tion­ally been used with the de re read­ing, but I could clar­ify that I’m work­ing on de dicto value al­ign­ment when more pre­ci­sion is needed (ev­ery­thing I work on is also rele­vant on the de re read­ing, so the other in­ter­pre­ta­tion is also ac­cu­rate and just less pre­cise).

I guess I have an analo­gous ques­tion for you: do you cur­rently have any ob­jec­tions to us­ing “benefi­cial AI” as the broader term, and “AI al­ign­ment” as the nar­rower term?

• This is also how MIRI origi­nally in­tro­duced the term. (I think they in­tro­duced it here, where they said “We call a smarter-than-hu­man sys­tem that re­li­ably pur­sues benefi­cial goals “al­igned with hu­man in­ter­ests” or sim­ply “al­igned.”“)

But that defi­ni­tion seems quite differ­ent from your “A is try­ing to do what H wants it to do.” For ex­am­ple, if H has a wrong un­der­stand­ing of his/​her true or nor­ma­tive val­ues and as a re­sult wants A to do some­thing that is ac­tu­ally harm­ful, then un­der your defi­ni­tion A would be still be “al­igned” but un­der MIRI’s defi­ni­tion it wouldn’t be (be­cause it wouldn’t be pur­su­ing benefi­cial goals).

This state­ment by Rob also makes me think that you wouldn’t in fact be happy with what he at least means by “al­ign­ment,” since I take it you ex­plic­itly mean to in­clude moral philos­o­phy?

I think that’s right. When I say MIRI/​Ar­bital defi­ni­tion of “al­ign­ment” I’m refer­ring to what’s they’ve posted pub­li­cly, and I be­lieve it does in­clude moral philos­o­phy. Rob’s state­ment that you quoted seems to be a pri­vate one (I don’t re­call see­ing it be­fore and can’t find it through Google search) but I can cer­tainly see how it mud­dies the wa­ters from your per­spec­tive.

Prob­a­bly “al­ign­ment” /​ “value al­ign­ment” would be a bet­ter split of terms than “al­ign­ment” vs. “mo­ti­va­tion”. “Value al­ign­ment” has tra­di­tion­ally been used with the de re read­ing, but I could clar­ify that I’m work­ing on de dicto value al­ign­ment when more pre­ci­sion is needed

This seems fine to me, if you could give the benefit of doubt as to when more pre­ci­sion is needed. I’m ba­si­cally wor­ried about this sce­nario: You or some­one else writes some­thing like “I’m cau­tiously op­ti­mistic about Paul’s work.” The reader re­calls see­ing you say that you work on “value al­ign­ment”. They match that to what they’ve read from MIRI about how al­igned AI “re­li­ably pur­sues benefi­cial goals”, and end up think­ing that is eas­ier than you’d in­tend, or think there is more dis­agree­ment be­tween al­ign­ment re­searchers about the difficulty of the broader prob­lem than there is ac­tu­ally is. If you could con­sis­tently say that the goal of your work is “de dicto value al­ign­ment” then that re­moves most of my worry.

I guess I have an analo­gous ques­tion for you: do you cur­rently have any ob­jec­tions to us­ing “benefi­cial AI” as the broader term, and “AI al­ign­ment” as the nar­rower term?

This ac­tu­ally seems best to me on the mer­its of the terms alone (i.e., putting his­tor­i­cal us­age aside), and I’d be fine with it if ev­ery­one could co­or­di­nate to switch to these terms/​defi­ni­tions.

• But that defi­ni­tion seems quite differ­ent from your “A is try­ing to do what H wants it to do.” For ex­am­ple, if H has a wrong un­der­stand­ing of his/​her true or nor­ma­tive val­ues and as a re­sult wants A to do some­thing that is ac­tu­ally harm­ful, then un­der your defi­ni­tion A would be still be “al­igned” but un­der MIRI’s defi­ni­tion it wouldn’t be (be­cause it wouldn’t be pur­su­ing benefi­cial goals).

“Do what H wants me to do” seems to me to be an ex­am­ple of a benefi­cial goal, so I’d say a sys­tem which is try­ing to do what H wants it to do is pur­su­ing a benefi­cial goals. It may also be pur­su­ing sub­goals which turn out to be harm­ful, if e.g. it’s wrong about what H wants or has other mis­taken em­piri­cal be­liefs. I don’t think any­one could be ad­vo­cat­ing the defi­ni­tion “pur­sues no harm­ful sub­goals,” since that ba­si­cally re­quires perfect em­piri­cal knowl­edge (it seems just as hard as never tak­ing a harm­ful ac­tion). Does that seem right to you?

I’ve been as­sum­ing that “re­li­ably pur­sues benefi­cial goals” is weaker than the defi­ni­tion I pro­posed, but prac­ti­cally equiv­a­lent as a re­search goal.

I’m ba­si­cally wor­ried about this sce­nario: You or some­one else writes some­thing like “I’m cau­tiously op­ti­mistic about Paul’s work.” The reader re­calls see­ing you say that you work on “value al­ign­ment”. They match that to what they’ve read from MIRI about how al­igned AI “re­li­ably pur­sues benefi­cial goals”, and end up think­ing that is eas­ier than you’d in­tend, or think there is more dis­agree­ment be­tween al­ign­ment re­searchers about the difficulty of the broader prob­lem than there is ac­tu­ally is. If you could con­sis­tently say that the goal of your work is “de dicto value al­ign­ment” then that re­moves most of my worry.

I think it’s rea­son­able for me to be more care­ful about clar­ify­ing what any par­tic­u­lar line of re­search agenda does or does not aim to achieve. I think that in most con­texts that is go­ing to re­quire more pre­ci­sion than just say­ing “AI al­ign­ment” re­gard­less of how the term was defined, I nor­mally clar­ify by say­ing some­thing like “an AI which is at least try­ing to help us get what we want.”

This ac­tu­ally seems best to me on the mer­its of the terms alone (i.e., putting his­tor­i­cal us­age aside), and I’d be fine with it if ev­ery­one could co­or­di­nate to switch to these terms/​defi­ni­tions.

My guess is that MIRI folks won’t like the “benefi­cial AI” term be­cause it is too broad a tent. (Which is also my ob­jec­tion to the pro­posed defi­ni­tion of “AI al­ign­ment,” as “over­ar­ch­ing re­search topic of how to de­velop suffi­ciently ad­vanced ma­chine in­tel­li­gences such that run­ning them pro­duces good out­comes in the real world.”) My sense is that if that were their po­si­tion, then you would also be un­happy with their pro­posed us­age of “AI al­ign­ment,” since you seem to want a broad tent that makes min­i­mal as­sump­tions about what prob­lems will turn out to be im­por­tant. Does that seem right?

(They might also dis­like “benefi­cial AI” be­cause of ran­dom con­tin­gent facts about how it’s been used in the past, and so might want a differ­ent term with the same mean­ing.)

My own feel­ing is that us­ing “benefi­cial AI” to mean “AI that pro­duces good out­comes in the world” is ba­si­cally just us­ing “benefi­cial” in ac­cor­dance with its usual mean­ing, and this isn’t a case where a spe­cial tech­ni­cal term is needed (and in­deed it’s weird to have a tech­ni­cal term whose defi­ni­tion is pre­cisely cap­tured by a sin­gle—differ­ent—word).

• “Do what H wants me to do” seems to me to be an ex­am­ple of a benefi­cial goal, so I’d say a sys­tem which is try­ing to do what H wants it to do is pur­su­ing a benefi­cial goals. It may also be pur­su­ing sub­goals which turn out to be harm­ful, if e.g. it’s wrong about what H wants or has other mis­taken em­piri­cal be­liefs. I don’t think any­one could be ad­vo­cat­ing the defi­ni­tion “pur­sues no harm­ful sub­goals,” since that ba­si­cally re­quires perfect em­piri­cal knowl­edge (it seems just as hard as never tak­ing a harm­ful ac­tion). Does that seem right to you?

I guess both “re­li­able” and “benefi­cial” are mat­ters of de­gree so “al­igned” in the sense of “re­li­ably pur­sues benefi­cial goals” is also a mat­ter of de­gree. “Do what H wants A to do” would be a mod­er­ate de­gree of al­ign­ment whereas “Suc­cess­fully figur­ing out and satis­fy­ing H’s true/​nor­ma­tive val­ues” would be a much higher de­gree of al­ign­ment (in that sense of al­ign­ment). Mean­while in your sense of al­ign­ment they are at best equally al­igned and the lat­ter might ac­tu­ally be less al­igned if H has a wrong idea of metaethics or what his true/​nor­ma­tive val­ues are and as a re­sult try­ing to figure out and satisfy those val­ues is not some­thing that H wants A to do.

I think that in most con­texts that is go­ing to re­quire more pre­ci­sion than just say­ing “AI al­ign­ment” re­gard­less of how the term was defined, I nor­mally clar­ify by say­ing some­thing like “an AI which is at least try­ing to help us get what we want.”

That seems good too.

My guess is that MIRI folks won’t like the “benefi­cial AI” term be­cause it is too broad a tent. (Which is also my ob­jec­tion to the pro­posed defi­ni­tion of “AI al­ign­ment,” as “over­ar­ch­ing re­search topic of how to de­velop suffi­ciently ad­vanced ma­chine in­tel­li­gences such that run­ning them pro­duces good out­comes in the real world.“) My sense is that if that were their po­si­tion, then you would also be un­happy with their pro­posed us­age of “AI al­ign­ment,” since you seem to want a broad tent that makes min­i­mal as­sump­tions about what prob­lems will turn out to be im­por­tant. Does that seem right?

This para­graph greatly con­fuses me. My un­der­stand­ing is that some­one from MIRI (prob­a­bly Eliezer) wrote the Ar­bital ar­ti­cle defin­ing “AI al­ign­ment” as “over­ar­ch­ing re­search topic of how to de­velop suffi­ciently ad­vanced ma­chine in­tel­li­gences such that run­ning them pro­duces good out­comes in the real world”, which satis­fies my de­sire to have a broad tent term that makes min­i­mal as­sump­tions about what prob­lems will turn out to be im­por­tant. I’m fine with call­ing this “benefi­cial AI” in­stead of “AI al­ign­ment” if ev­ery­one can co­or­di­nate on this (but I don’t know how MIRI peo­ple feel about this). I don’t un­der­stand why you think ‘MIRI folks won’t like the “benefi­cial AI” term be­cause it is too broad a tent’ given that some­one from MIRI gave a very broad defi­ni­tion to “AI al­ign­ment”. Do you per­haps think that Ar­bital ar­ti­cle was writ­ten by a non-MIRI per­son?

• “Do what H wants A to do” would be a mod­er­ate de­gree of al­ign­ment whereas “Suc­cess­fully figur­ing out and satis­fy­ing H’s true/​nor­ma­tive val­ues” would be a much higher de­gree of al­ign­ment (in that sense of al­ign­ment).

In what sense is that a more benefi­cial goal?

• “Suc­cess­fully do X” seems to be the same goal as X, isn’t it?

• “Figure out H’s true/​nor­ma­tive val­ues” is man­i­festly a sub­goal of “satisfy H’s true/​nor­ma­tive val­ues.” Why would we care about that ex­cept as a sub­goal?

• So is the differ­ence en­tirely be­tween “satisfy H’s true/​nor­ma­tive val­ues” and “do what H wants”? Do you dis­agree with one of the pre­vi­ous two bul­let points? Is the differ­ence that you think “re­li­ably pur­sues” im­plies some­thing about “ac­tu­ally achieves”?

If the differ­ence is mostly be­tween “what H wants” and “what H truly/​nor­ma­tively val­ues”, then this is just a com­mu­ni­ca­tion difficulty. For me adding “truly” or “nor­ma­tively” to “val­ues” is just em­pha­sis and doesn’t change the mean­ing.

I try to make it clear that I’m us­ing “want” to re­fer to some hard-to-define ideal­iza­tion rather than some nar­row con­cept, but I can see how “want” might not be a good term for this, I’d be fine us­ing “val­ues” or some­thing along those lines if that would be clearer.

(This is why I wrote:

What H wants” is even more prob­le­matic than “try­ing.” Clar­ify­ing what this ex­pres­sion means, and how to op­er­a­tional­ize it in a way that could be used to in­form an AI’s be­hav­ior, is part of the al­ign­ment prob­lem. Without ad­di­tional clar­ity on this con­cept, we will not be able to build an AI that tries to do what H wants it to do.

)

• If the differ­ence is mostly be­tween “what H wants” and “what H truly/​nor­ma­tively val­ues”, then this is just a com­mu­ni­ca­tion difficulty. For me adding “truly” or “nor­ma­tively” to “val­ues” is just em­pha­sis and doesn’t change the mean­ing.

Ah, yes that is a big part of what I thought was the differ­ence. (Ac­tu­ally I may have un­der­stood at some point that you meant “want” in an ideal­ized sense but then for­got and didn’t re-read the post to pick up that un­der­stand­ing again.)

ETA: I guess an­other thing that con­tributed to this con­fu­sion is your talk of val­ues evolv­ing over time, and of prefer­ences about how they evolve, which seems to sug­gest that by “val­ues” you mean some­thing like “cur­rent un­der­stand­ing of val­ues” or “in­terim val­ues” rather than “true/​nor­ma­tive val­ues” since it doesn’t seem to make sense to want one’s true/​nor­ma­tive val­ues to change over time.

I try to make it clear that I’m us­ing “want” to re­fer to some hard-to-define ideal­iza­tion rather than some nar­row con­cept, but I can see how “want” might not be a good term for this, I’d be fine us­ing “val­ues” or some­thing along those lines if that would be clearer.

I don’t think “val­ues” is good ei­ther. Both “want” and “val­ues” are com­monly used words that typ­i­cally (in ev­ery­day us­age) mean some­thing like “some­one’s cur­rent un­der­stand­ing of what they want” or what I called “in­terim val­ues”. I don’t see how you can ex­pect peo­ple not to be fre­quently con­fused if you use ei­ther of them to mean “true/​nor­ma­tive val­ues”. Like the situ­a­tion with de re /​ de dicto al­ign­ment, I sug­gest it’s not worth try­ing to econ­o­mize on the ad­jec­tives here.

Another differ­ence be­tween your defi­ni­tion of al­ign­ment and “re­li­ably pur­sues benefi­cial goals” is that the lat­ter has “re­li­ably” in it which sug­gests more of a de re read­ing. To use your ex­am­ple “Sup­pose A thinks that H likes ap­ples, and so goes to the store to buy some ap­ples, but H re­ally prefers or­anges.” I think most peo­ple would call an A that cor­rectly un­der­stands H’s prefer­ences (and gets or­anges) more re­li­ably pur­su­ing benefi­cial goals.

Given this, per­haps the eas­iest way to re­duce con­fu­sions mov­ing for­ward is to just use some ad­jec­tives to dis­t­in­guish your use of the words “want”, “val­ues”, or “al­ign­ment” from other peo­ple’s.

• If the differ­ence is mostly be­tween “what H wants” and “what H truly/​nor­ma­tively val­ues”, then this is just a com­mu­ni­ca­tion difficulty. For me adding “truly” or “nor­ma­tively” to “val­ues” is just em­pha­sis and doesn’t change the mean­ing.

So “wants” means a want more gen­eral than an ob­ject-level de­sire (like want­ing to buy or­anges), and it already takes into ac­count the pos­si­bil­ity of H chang­ing his mind about what he wants if H dis­cov­ers that his wants con­tra­dict his nor­ma­tive val­ues?

If that’s right, how is this gen­er­al­iza­tion defined? (E.g. The CEV was “what H wants in the limit of in­finite in­tel­li­gence, rea­son­ing time and com­plete in­for­ma­tion”.)

• I don’t un­der­stand why you think ‘MIRI folks won’t like the “benefi­cial AI” term be­cause it is too broad a tent’ given that some­one from MIRI gave a very broad defi­ni­tion to “AI al­ign­ment”. Do you per­haps think that Ar­bital ar­ti­cle was writ­ten by a non-MIRI per­son?

I don’t re­ally know what any­one from MIRI thinks about this is­sue. It was a guess based on (a) the fact that Rob didn’t like a num­ber of pos­si­ble al­ter­na­tive terms to “al­ign­ment” be­cause they seemed to be too broad a defi­ni­tion, (b) the fact that vir­tu­ally ev­ery MIRI us­age of “al­ign­ment” refers to a much nar­rower class of prob­lems than “benefi­cial AI” is usu­ally taken to re­fer to, (c) the fact that Eliezer gen­er­ally seems frus­trated with peo­ple talk­ing about other prob­lems un­der the head­ing of “benefi­cial AI.”

(But (c) might be driven by pow­er­ful AI vs. nearer-term con­cerns /​ all the other em­piri­cal er­rors Eliezer thinks peo­ple are mak­ing, (b) isn’t that in­dica­tive, and (a) might be driven by other cul­tural bag­gage as­so­ci­ated with the term /​ Rob was speak­ing off the cuff and not at­tempt­ing to speak for­mally for MIRI.)

I’d con­sider it great if we stan­dard­ized on “benefi­cial AI” to mean “AI that has good con­se­quences” and “AI al­ign­ment” to re­fer to the nar­rower prob­lem of al­ign­ing AI’s mo­ti­va­tion/​prefer­ences/​goals.

• I don’t un­der­stand how this is com­pat­i­ble with only 2% loss from value drift/​cor­rup­tion. Do you per­haps think the ac­tual loss is much big­ger, but al­most cer­tainly we just can’t do any­thing about it, so 2% is how much you ex­pect we can po­ten­tially “save” from value drift/​cor­rup­tion? Or are you tak­ing an anti-re­al­ist po­si­tion and say­ing some­thing like, if some­one doesn’t care about avert­ing drift/​cor­rup­tion, then how­ever their val­ues drift that doesn’t con­sti­tute any loss?

10x worse was origi­nally my es­ti­mate for cost-effec­tive­ness, not for to­tal value at risk.

Peo­ple not car­ing about X prima fa­cie de­creases the re­turns to re­search on X. But may in­crease the re­turns for ad­vo­cacy (or ac­quiring re­sources/​in­fluence, or more cre­ative in­ter­ven­tions). That bul­let point was re­ally about the re­turns to re­search.

• Peo­ple not car­ing about X prima fa­cie de­creases the re­turns to re­search on X. But may in­crease the re­turns for ad­vo­cacy (or ac­quiring re­sources/​in­fluence, or more cre­ative in­ter­ven­tions). That bul­let point was re­ally about the re­turns to re­search.

It’s not ob­vi­ous that ap­plies here. If peo­ple don’t care strongly about how their val­ues evolve over time, that seem­ingly gives AIs /​ AI de­sign­ers an open­ing to have greater in­fluence over how peo­ple’s val­ues evolve over time, and im­plies a larger (or at least not ob­vi­ously smaller) re­turn on re­search into how to do this prop­erly. Or if peo­ple care a bit about pro­tect­ing their val­ues from ma­nipu­la­tion from other AIs but not a lot, it seems re­ally im­por­tant/​valuable to re­duce the cost of such pro­tec­tion as much as pos­si­ble.

As for ad­vo­cacy, it seems a lot eas­ier (at least for some­one in my po­si­tion) to con­vince a rel­a­tively small num­ber of AI de­sign­ers to build AIs that want to help their users evolve their val­ues in a pos­i­tive way (or figur­ing out what their true or nor­ma­tive val­ues are, or pro­tect­ing their val­ues against ma­nipu­la­tion), than to con­vince all the po­ten­tial users to want that them­selves.

• I agree that:

• If peo­ple care less about some as­pect of the fu­ture, then try­ing to get in­fluence over that as­pect of the fu­ture is more at­trac­tive (whether by build­ing tech­nol­ogy that they ac­cept as a de­fault, or by mak­ing an ex­plicit trade, or what­ever).

• A bet­ter un­der­stand­ing of how to pre­vent value drift can still be helpful if peo­ple care a lit­tle bit, and can be par­tic­u­larly use­ful to the peo­ple who care a lot (and there will be fewer peo­ple work­ing to de­velop such un­der­stand­ing if few peo­ple care).

I think that both

• (a) Try­ing to have in­fluence over as­pects of value change that peo­ple don’t much care about, and

• (b) bet­ter un­der­stand­ing the im­por­tant pro­cesses driv­ing changes in values

are rea­son­able things to do to make the fu­ture bet­ter. (Though some parts of (a) es­pe­cially are some­what zero-sum and I think it’s worth be­ing thought­ful about that.)

(I don’t agree with the sign of the effect de­scribed in your com­ment, but don’t think it’s an im­por­tant point /​ may just be a dis­agree­ment about what else we are hold­ing equal so it seems good to drop.)

• Try­ing to have in­fluence over as­pects of value change that peo­ple don’t much care about … [is] rea­son­able … to do to make the fu­ture better

This could re­fer to value change in AI con­trol­lers, like Hugh in HCH, or al­ter­na­tively to value change in peo­ple liv­ing in the AI-man­aged world. I be­lieve the lat­ter could be good, but the former seems very ques­tion­able (here “value” refers to true/​nor­ma­tive/​ideal­ized prefer­ence). So it’s hard for the same peo­ple to share the two roles. How do you en­sure that value change re­mains good in the origi­nal sense with­out a refer­ence to prefer­ence in the origi­nal sense, that hasn’t ex­pe­rienced any value change, a refer­ence that re­mains in con­trol? And for this dis­cus­sion, it seems like the val­ues of AI con­trol­lers (or AI+con­trol­lers) is what’s rele­vant.

It’s agent tiling for AI+con­trol­ler agents, any value change in the whole seems to be a mis­take. It might be OK to change val­ues of sub­agents, but the whole shouldn’t show any value drift, only in­stru­men­tally use­ful trade­offs that sac­ri­fice less im­por­tant as­pects of what’s done for more im­por­tant as­pects, but still from the point of view of un­changed origi­nal val­ues (to the ex­tent that they are defined at all).

• As­sum­ing you agree that we can’t be cer­tain about which metaeth­i­cal po­si­tion is cor­rect yet, I think by im­plic­itly adopt­ing a sub­jec­tivist/​anti-re­al­ist fram­ing, you make the prob­lem seem eas­ier than we should ex­pect it to be.

I don’t see why the anti-re­al­ist ver­sion is any eas­ier, my prefer­ences about how my val­ues evolve are com­plex and can de­pend on the end­point of that evolu­tion pro­cess and on ar­bi­trar­ily com­plex log­i­cal facts. I think the analo­gous non-re­al­is­tic math­e­mat­i­cal fram­ing is fine. If any­thing the re­al­ist ver­sions seem eas­ier to me (and this is re­lated to why math­e­mat­ics seems so much eas­ier than moral­ity), since you can an­chor chang­ing prefer­ences to some un­der­ly­ing ground truth and have more po­ten­tial prospect for er­ror-cor­rec­tion, but I don’t think it’s a big differ­ence.

Ad­di­tion­ally, this fram­ing also makes the po­ten­tial con­se­quences of failing to solve the prob­lem sound less se­ri­ous than it could po­ten­tially be. I.e., if there is such a thing as some­one’s true or nor­ma­tive val­ues, then failing to op­ti­mize the uni­verse for those val­ues is re­ally bad, but if they just have prefer­ences about how their val­ues evolve, then even if their val­ues fail to evolve in that way, at least what­ever val­ues the uni­verse ends up be­ing op­ti­mized for are still their val­ues, so not all is lost.

It doesn’t sound that way to me, but I’m happy to avoid fram­ings that might give peo­ple the wrong idea.

I think I would pre­fer to frame the prob­lem as “How can we de­sign/​use AI to pre­vent the cor­rup­tion of hu­man val­ues, es­pe­cially cor­rup­tion caused/​ex­ac­er­bated by the de­vel­op­ment of AI?”

My main com­plaint with this fram­ing (and the rea­son that I don’t use it) is that peo­ple re­spond badly to in­vok­ing the con­cept of “cor­rup­tion” here—it’s a fuzzy cat­e­gory that we don’t un­der­stand, and peo­ple seem to in­ter­pret it as the speaker want­ing val­ues to re­main static.

But in terms of the ac­tual mean­ings rather than their im­pacts on peo­ple, I’d be about as happy with “avoid­ing cor­rup­tion of val­ues” as “hav­ing our val­ues evolve in a pos­i­tive way.” I think both of them have small short­com­ings as fram­ings. My main prob­lem with cor­rup­tion is that it sug­gests an un­re­al­is­ti­cally bright line /​ down­plays our un­cer­tainty about how to think about chang­ing val­ues and what con­sti­tutes cor­rup­tion.

• I don’t see why the anti-re­al­ist ver­sion is any easier

It seems eas­ier in that the AI /​ AI de­signer doesn’t have to worry about the user be­ing wrong about how they want their val­ues to evolve. But you’re right that the re­al­ist ver­sion might be eas­ier in other ways, so per­haps what I should say in­stead is that the prob­lem definitely seems harder if we also in­clude the sub­prob­lem of figur­ing out what the right metaethics is in the first place, and (by im­plic­itly as­sum­ing a sub­set of all plau­si­ble metaeth­i­cal po­si­tions) the state­ment of the prob­lem that you pro­posed also does not con­vey a proper amount of un­cer­tainty in its difficulty.

My main com­plaint with this fram­ing (and the rea­son that I don’t use it) is that peo­ple re­spond badly to in­vok­ing the con­cept of “cor­rup­tion” here—it’s a fuzzy cat­e­gory that we don’t un­der­stand, and peo­ple seem to in­ter­pret it as the speaker want­ing val­ues to re­main static.

That’s a good point that I hadn’t thought of. (I guess talk­ing about “drift” has a similar is­sue though, in that peo­ple might mis­in­ter­pret it as the speaker want­ing val­ues to re­main static.) If you or any­one else have a sug­ges­tion about how to phrase the prob­lem so as to both avoid this is­sue and ad­dress my con­cerns about not as­sum­ing a par­tic­u­lar metaeth­i­cal po­si­tion, I’d highly wel­come that.

• It seems eas­ier in that the AI /​ AI de­signer doesn’t have to worry about the user be­ing wrong about how they want their val­ues to evolve.

That may be a con­no­ta­tion of the “prefer­ences about how their val­ues evolve,” but doesn’t seem like it fol­lows from the anti-re­al­ist po­si­tion.

I have prefer­ences over what ac­tions my robot takes. Yet if you asked me “what ac­tion do you want the robot to take?” I could be mis­taken. I need not have ac­cess to my own prefer­ences (since they can e.g. de­pend on em­piri­cal facts I don’t know). My prefer­ences over value evolu­tion can be similar.

In­deed, if moral re­al­ists are right, “ul­ti­mately con­verge to the truth” is a perfectly rea­son­able prefer­ence to have about how my prefer­ences evolve. (Though again this may not be cap­tured by the fram­ing “help peo­ple’s prefer­ences evolve in the way they want them to evolve.”) Per­haps the dis­tinc­tion is that there is some kind of ideal­iza­tion even of the way that prefer­ences evolve, and maybe at that point it’s eas­ier to just talk about preser­va­tion of ideal­ized prefer­ences (though that also has un­for­tu­nate im­pli­ca­tions and at least some minor tech­ni­cal prob­lems).

I guess talk­ing about “drift” has a similar is­sue though, in that peo­ple might mis­in­ter­pret it as the speaker want­ing val­ues to re­main static.

I agree that drift is also prob­le­matic.

• Would you agree with this way of stat­ing it: There are more ways for some­one to be wrong about their val­ues un­der re­al­ism than un­der anti-re­al­ism. Un­der re­al­ism some­one could be wrong even if they cor­rectly state their prefer­ences about how they want their val­ues to evolve, be­cause those prefer­ences could them­selves be wrong. So as­sum­ing an anti-re­al­ist po­si­tion makes the prob­lem sound eas­ier be­cause it im­plies there are fewer ways for the user to be wrong for the AI /​ AI de­signer to worry about.

• Could you give an ex­am­ple of a state­ment you think could be wrong on the re­al­ist per­spec­tive, for which there couldn’t be a pre­cisely analo­gous er­ror on the non-re­al­is­tic per­spec­tive?

There is some un­in­ter­est­ing se­man­tic sense in which there are “more ways to be wrong” (since there is a whole ex­tra cat­e­gory of state­ments that have truth val­ues...) but not a sense that is rele­vant to the difficulty of build­ing an AI.

I might be us­ing the word “val­ues” in a differ­ent way than. I think I can say some­thing like “I’d like to de­liber­ate in way X” and be wrong. I guess un­der non-re­al­ism I’m “in­cor­rectly stat­ing my prefer­ences” and un­der re­al­ism I could be “cor­rectly stat­ing my prefer­ences but be wrong,” but I don’t see how to trans­late that differ­ence into any situ­a­tion where I build an AI that is ad­e­quate on one per­spec­tive but in­ad­e­quate on the other.

• Sup­pose the user says “I want to try to figure out my true/​nor­ma­tive val­ues by do­ing X. Please help me do that.” If moral anti-re­al­ism is true, then the AI can only check if the user re­ally wants to do X (e.g., by look­ing into the user’s brain and check­ing if X is en­coded as a prefer­ence some­where). But if moral re­al­ism is true, the AI could also use its own un­der­stand­ing of metaethics and metaphilos­o­phy to pre­dict if do­ing X would re­li­ably lead to the user’s true/​nor­ma­tive val­ues, and warn the user or re­fuse to help or take some other ac­tion if the an­swer is no. Or if one can’t be cer­tain about metaethics yet, and it looks like X might pre­ma­turely lock the user into the wrong val­ues, the AI could warn the user about that.

• I definitely don’t mean such a nar­row sense of “want my val­ues to evolve.” Seems worth us­ing some lan­guage to clar­ify that.

In gen­eral the three op­tions seem to be:

• You care about what is “good” in the re­al­ist sense.

• You care about what the user “ac­tu­ally wants” in some ideal­ized sense.

• You care about what the user “cur­rently wants” in some nar­row sense.

It seems to me that the first two are pretty similar. (And if you are un­cer­tain about whether re­al­ism is true, and you’d be in the first case if you ac­cepted re­al­ism, it seems like you’d prob­a­bly be in the sec­ond case if you re­jected re­al­ism. Of course that would de­pend on the na­ture of your un­cer­tainty about re­al­ism, your views could de­pend on an ar­bi­trary way on whether re­al­ism is true or false de­pend­ing on what ver­sions of re­al­ism/​non-re­al­ism are com­pet­ing, but I’m as­sum­ing some­thing like the most com­mon re­al­ist and non-re­al­ist views around here.)

To defend my origi­nal us­age both in this thread and in the OP, which I’m not that at­tached to, I do think it would be typ­i­cal to say that some­one made a mis­take if they were try­ing to help me get what I wanted, but failed to no­tice or com­mu­ni­cate some cru­cial con­sid­er­a­tion that would to­tally change my views about what I wanted—the usual English us­age of these terms in­volves at least mild ideal­iza­tion.

• Yes, that seems easy, but ac­tu­ally helping seems much harder.

Longer form of my opinion:

Me­taphilos­o­phy is hard, and we need to solve it even­tu­ally. This might hap­pen by de­fault, i.e. if we sim­ply build a well-mo­ti­vated AI with­out think­ing about metaphilos­o­phy and with­out run­ning any so­cial in­ter­ven­tions de­signed to get the AI’s op­er­a­tors to think about metaphilos­o­phy, hu­man­ity might still re­al­ize that metaphilos­o­phy needs to be solved, and then goes ahead and solves it. I’m quite un­sure right now whether or not it will hap­pen by de­fault.

How­ever, in the world where the AI’s op­er­a­tors don’t agree that we need to solve metaphilos­o­phy, I am very pes­simistic about the AI re­al­iz­ing that it should help us with metaphilos­o­phy and do­ing so. The one way I could imag­ine it hap­pen­ing is by pro­gram­ming in the right util­ity func­tion (not even learn­ing it, since if you learn it then you prob­a­bly learn that metaphilos­o­phy doesn’t need to be solved), which seems hope­lessly doomed. It seems re­ally hard to make an AI sys­tem where you can pre­dict in ad­vance that it will help us solve metaphilos­o­phy re­gard­less of the op­er­a­tor’s wishes.

In the world where the AI’s op­er­a­tors do agree that we need to solve metaphilos­o­phy, I think we’re in a much bet­ter po­si­tion. A back­ground as­sump­tion I have is that hu­mans mo­ti­vated to solve metaphilos­o­phy will be able to do so given enough time—I share Paul’s in­tu­ition that hu­mans who no longer have to worry about food, wa­ter, shelter, dis­ease, etc. could de­liber­ate for a long time and make progress. In that case, a well-mo­ti­vated AI would be fine—it would stay defer­en­tial, per­haps learn more things in or­der to be more com­pe­tent, and does things we ask it to do, which might in­clude helping us in our de­liber­a­tion by bring­ing up ar­gu­ments we hadn’t con­sid­ered yet. (And note a well-mo­ti­vated AI should only bring up ar­gu­ments it be­lieves are true, or likely to be true.)

I’ve laid out two ex­treme ways the world could be, and of course there’s a spec­trum be­tween them. But think­ing about the ex­tremes makes me think of this not as a part of AI al­ign­ment, but as a so­cial co­or­di­na­tion prob­lem, that is, we need to have hu­man­ity (es­pe­cially the AI’s op­er­a­tors) agree that metaphilos­o­phy is hard and needs to be solved. I’d sup­port in­ter­ven­tions that make this more likely, eg. more pub­lic writ­ing that talks about what we do af­ter AGI, or about the pos­si­bil­ity of a Great De­liber­a­tion be­fore us­ing the cos­mic en­dow­ment, etc. If we suc­ceed at that and build­ing a well-mo­ti­vated AI sys­tem, I think that would be suffi­cient.

How do you de­ter­mine what is “high-im­pact” be­fore you have a util­ity func­tion? Even “re­versible” is rel­a­tive to a util­ity func­tion, right? It doesn’t mean that you liter­ally can re­verse all the con­se­quences of an ac­tion, but rather that you can re­verse the im­pact of that ac­tion on your util­ity?

I mean some­thing more like “don’t do things that a hu­man wouldn’t do, that seem crazy from a hu­man per­spec­tive”. I’m not sug­gest­ing that the AI has a perfect un­der­stand­ing of what “ir­re­versible” and “high-im­pact” mean. But it should be able to pre­dict what things a hu­man would find crazy for which it should prob­a­bly get the hu­man’s ap­proval be­fore do­ing the thing. (As an anal­ogy, most em­ploy­ees have a sense of what it is okay for them to take ini­ti­a­tive on, vs. what they should get their man­ager’s ap­proval for.)

For ex­am­ple if I were a nega­tive util­i­tar­ian I per­haps ought to spend all my re­sources try­ing to stop tech­nolog­i­cal progress lead­ing to space coloniza­tion, so any­thing that I do be­sides that would be “ir­re­versible high-im­pact” un­less I could go back in time and change my re­source al­lo­ca­tion.

Yeah, I more mean some­thing like “con­tinu­a­tion of the sta­tus quo” rather than “ir­re­versible high-im­pact”, as TurnTrout talks about be­low.

Do you think it would be easy to solve or avoid all of these prob­lems?

I am not sure. I think it is rel­a­tively easy to look back at how we have re­sponded to similar events in the past and no­tice that some­thing is amiss—for ex­am­ple, it seems rel­a­tively easy for an AGI to figure out that power cor­rupts and that hu­man­ity has not liked it when that hap­pened, or that many hu­mans don’t like it when you take ad­van­tage of their mo­ti­va­tional sys­tems, and so to at least not be con­fi­dent in the ac­tions you men­tion. On the other hand, there may be similar types of events in the fu­ture that we can’t back out by look­ing at the past. I don’t know how to deal with these sorts of un­known un­knowns.

I think suffi­ciently nar­row AI sys­tems have es­sen­tially no hope of solv­ing or avoid­ing these prob­lems in gen­eral, re­gard­less of safety tech­niques we de­velop, and so in the short term to avoid these prob­lems you want to in­ter­vene on the hu­mans who are de­ploy­ing AI sys­tems.

Why is that a fur­ther-away prob­lem? Even if it is, we still need peo­ple to work on them now, if only to gen­er­ate per­sua­sive ev­i­dence in case they re­ally are very hard prob­lems so we can pur­sue some other strat­egy to avoid them like stop­ping or de­lay­ing the de­vel­op­ment of ad­vanced AI as much as pos­si­ble.

Yeah, look­ing back I don’t like that rea­son, I think I had an in­tu­ition that it wasn’t an ur­gent prob­lem and wanted to jot a quick sen­tence to that effect and the sen­tence came out wrong.

One rea­son it might not be ur­gent is be­cause we need to aim for com­pet­i­tive­ness any­way—our AI sys­tems need to be com­pet­i­tive so that eco­nomic in­cen­tives don’t cause us to use un­al­igned var­i­ants.

We can also aim to have the world mostly run by al­igned AI sys­tems rather than un­al­igned ones, which would mean that there isn’t much op­por­tu­nity for us to be ma­nipu­lated. You might have the in­tu­ition that even one un­al­igned AI could suc­cess­fully ma­nipu­late ev­ery­one’s val­ues, and so we would still need the al­igned AI sys­tems to be able to defend against that. I’m not sure where I stand on that—it seems pos­si­ble to me that this is just very hard to do, es­pe­cially when there are al­igned su­per­in­tel­li­gent sys­tems that would by de­fault put a stop to it if they find out about it.

But re­ally I’m just con­fused on this topic and would need to think more about it.

• we need to have hu­man­ity (es­pe­cially the AI’s op­er­a­tors) agree that metaphilos­o­phy is hard and needs to be solved

I’m not sure I un­der­stand your pro­posal here. What are they agree­ing to ex­actly? Stop­ping tech­nolog­i­cal de­vel­op­ment at a cer­tain level un­til metaphilos­o­phy is solved?

But it should be able to pre­dict what things a hu­man would find crazy for which it should prob­a­bly get the hu­man’s ap­proval be­fore do­ing the thing

Think of the hu­man as a re­ally badly de­signed AI with a con­voluted ar­chi­tec­ture that no­body un­der­stands, spaghetti code, full of se­cu­rity holes, has no idea what its ter­mi­nal val­ues are and is re­ally con­fused even about its “in­terim” val­ues, has all kinds of po­ten­tial safety prob­lems like not be­ing ro­bust to dis­tri­bu­tional shifts, and is only “safe” in the sense of hav­ing passed cer­tain tests for a very nar­row dis­tri­bu­tion of in­puts.

Clearly it’s not safe for a much more pow­er­ful outer AI to query the hu­man about ar­bi­trary ac­tions that it’s con­sid­er­ing, right? In­stead, if the hu­man is to con­tribute any­thing at all to safety in this situ­a­tion, the outer AI has to figure out how to gen­er­ate a bunch of smaller queries that the hu­man can safely han­dle, from which it would then in­fer what the hu­man would say if it could safely con­sider the ac­tual choice un­der con­sid­er­a­tion. If the AI is bad at this “com­pe­tence” prob­lem it could send un­safe queries to the hu­man and cor­rupt the hu­man, and/​or in­fer the wrong thing about what the hu­man would ap­prove of.

Is it clearer now why this doesn’t seem like an easy prob­lem to me?

for ex­am­ple, it seems rel­a­tively easy for an AGI to figure out that power cor­rupts and that hu­man­ity has not liked it when that happened

I’m not sure what you think the AGI would figure out, and what it would do in re­sponse to that. Are you sug­gest­ing some­thing like, based on his­tor­i­cal data, it would learn a clas­sifier to pre­dict what kind of new tech­nolo­gies or choices would change hu­man val­ues in a way that we would not like, and re­strict those tech­nolo­gies/​choices from us? It seems far from easy to do this in a ro­bust way. I mean this clas­sifier would be fac­ing lots of un­pre­dictable dis­tri­bu­tional shifts… I guess you made a similar point when you said “On the other hand, there may be similar types of events in the fu­ture that we can’t back out by look­ing at the past.”

ETA: Do you ex­pect that differ­ent AIs would do differ­ent things in this re­gard de­pend­ing on how cau­tious their op­er­a­tors are? Like some AIs would learn from their op­er­a­tors to be re­ally cau­tious, and re­strict tech­nolo­gies/​choices that it isn’t sure won’t cor­rupt hu­mans, but other op­er­a­tors and their AIs won’t be so cau­tious so a bunch of hu­mans will be cor­rupted as a re­sult, but that’s a lower pri­or­ity prob­lem be­cause you think most AI op­er­a­tors will be re­ally cau­tious so the per­centage of value lost in the uni­verse isn’t very high? (This is my cur­rent un­der­stand­ing of Paul’s po­si­tion, and I won­der if you have a differ­ent po­si­tion or a differ­ent way of putting it that would con­vince me more.) What about the prob­lem that the cor­rupted hu­mans/​AIs could pro­duce a lot of nega­tive util­ity even if they are small in num­bers? What about the prob­lem of the cau­tious AIs be­ing at a com­pet­i­tive dis­ad­van­tage against other AIs who are less cau­tious about what they are will­ing to do?

I think suffi­ciently nar­row AI sys­tems have es­sen­tially no hope of solv­ing or avoid­ing these prob­lems in gen­eral, re­gard­less of safety tech­niques we de­velop, and so in the short term to avoid these prob­lems you want to in­ter­vene on the hu­mans who are de­ploy­ing AI sys­tems.

This seems right.

We can also aim to have the world mostly run by al­igned AI sys­tems rather than un­al­igned ones, which would mean that there isn’t much op­por­tu­nity for us to be ma­nipu­lated.

Ma­nipu­la­tion doesn’t have to come just from un­al­igned AIs, it could also come from AIs that are al­igned to other peo­ple. For ex­am­ple, if an AI is al­igned to Alice, and Alice sees some­thing to be gained by ma­nipu­lat­ing Bob, the AI be­ing al­igned won’t stop Alice from us­ing it to ma­nipu­late Bob.

ETA: I for­got to men­tion that I don’t un­der­stand this part, can you please ex­plain more:

One rea­son it might not be ur­gent is be­cause we need to aim for com­pet­i­tive­ness any­way—our AI sys­tems need to be com­pet­i­tive so that eco­nomic in­cen­tives don’t cause us to use un­al­igned var­i­ants.

• I’m not sure I un­der­stand your pro­posal here. What are they agree­ing to ex­actly? Stop­ping tech­nolog­i­cal de­vel­op­ment at a cer­tain level un­til metaphilos­o­phy is solved?

I don’t know, I want to out­source that de­ci­sion to hu­mans + AI at the time where it is rele­vant. Per­haps it in­volves stop­ping tech­nolog­i­cal de­vel­op­ment. Per­haps it means con­tin­u­ing tech­nolog­i­cal de­vel­op­ment, but not do­ing any space coloniza­tion. My point is sim­ply that if hu­mans agree that metaphilos­o­phy needs to be solved, and the AI is try­ing to help hu­mans, then metaphilos­o­phy will prob­a­bly be solved, even if I don’t know how ex­actly it will hap­pen.

Is it clearer now why this doesn’t seem like an easy prob­lem to me?

Yes. It seems to me like you’re con­sid­er­ing the case where a hu­man has to be able to give the cor­rect an­swer to any ques­tion of the form “is this ac­tion a good thing to do?” I’m claiming that we could in­stead grow the set of things the AI does grad­u­ally, to give time for hu­mans to figure out what it is they want. So I was imag­in­ing that hu­mans would an­swer the AI’s ques­tions in a frame where they have a lot of risk aver­sion, so any­thing that seemed par­tic­u­larly im­pact­ful would re­quire a lot of de­liber­a­tion be­fore be­ing ap­proved.

I’m not sure what you think the AGI would figure out, and what it would do in re­sponse to that. Are you sug­gest­ing some­thing like, based on his­tor­i­cal data, it would learn a clas­sifier to pre­dict what kind of new tech­nolo­gies or choices would change hu­man val­ues in a way that we would not like, and re­strict those tech­nolo­gies/​choices from us?

I was think­ing more of the case where a sin­gle hu­man amassed a lot of power. Hu­mans haven’t seemed to solve the prob­lem of pre­dict­ing how new tech­nolo­gies/​choices would change hu­man val­ues, so that seems like quite a hard prob­lem to solve (but per­haps AI could do it). I meant more that con­di­tional on the AI know­ing how some new tech­nol­ogy or choice would af­fect us, it seems not too hard to figure out whether we would view it as a good thing.

Do you ex­pect that differ­ent AIs would do differ­ent things in this re­gard de­pend­ing on how cau­tious their op­er­a­tors are?

Yes.

that’s a lower pri­or­ity prob­lem be­cause you think most AI op­er­a­tors will be re­ally cau­tious so the per­centage of value lost in the uni­verse isn’t very high?

Kind of? I’d amend that slightly to say that to the ex­tent that I think it is a prob­lem (I’m not sure), I want to solve it in some way that is not tech­ni­cal re­search. (Pos­si­bil­ities: con­vince ev­ery­one to be cau­tious, ob­tain a de­ci­sive strate­gic ad­van­tage and en­force that ev­ery­one is cau­tious.)

What about the prob­lem that the cor­rupted hu­mans/​AIs could pro­duce a lot of nega­tive util­ity even if they are small in num­bers?

Same as above.

Ma­nipu­la­tion doesn’t have to come just from un­al­igned AIs, it could also come from AIs that are al­igned to other peo­ple. For ex­am­ple, if an AI is al­igned to Alice, and Alice sees some­thing to be gained by ma­nipu­lat­ing Bob, the AI be­ing al­igned won’t stop Alice from us­ing it to ma­nipu­late Bob.

Same as above. All of these prob­lems that you’re talk­ing about would also ap­ply to tech­nol­ogy that could make a hu­man smarter. It seems like it would be eas­iest to ad­dress on that level, rather than try­ing to build an AI sys­tem that can deal with these prob­lems even though the op­er­a­tor would not want them to cor­rect for the prob­lem.

What about the prob­lem of the cau­tious AIs be­ing at a com­pet­i­tive dis­ad­van­tage against other AIs who are less cau­tious about what they are will­ing to do?

This seems like an em­piri­cal fact that makes the prob­lems listed above harder to solve.

I for­got to men­tion that I don’t un­der­stand this part, can you please ex­plain more:
One rea­son it might not be ur­gent is be­cause we need to aim for com­pet­i­tive­ness any­way—our AI sys­tems need to be com­pet­i­tive so that eco­nomic in­cen­tives don’t cause us to use un­al­igned var­i­ants.

So I broadly agree with Paul’s rea­sons for aiming for com­pet­i­tive­ness. Given com­pet­i­tive­ness, you might hope that we would au­to­mat­i­cally get defense against value ma­nipu­la­tion by other AIs, since our al­igned AI will defend us from value ma­nipu­la­tion by similarly-ca­pa­ble un­al­igned AIs (or al­igned AIs that other peo­ple have). Of course, defense might be a lot harder than offense, and you prob­a­bly do think that, in which case this doesn’t re­ally help us. (As I said, I haven’t re­ally thought about this be­fore.)

Over­all view: I don’t think that the prob­lems you’ve men­tioned are ob­vi­ously go­ing to be solved as a part of AI al­ign­ment. I think that solv­ing them will re­quire mostly in­ter­ven­tions on hu­mans, not on the de­vel­op­ment of AI. I am weakly op­ti­mistic that hu­mans will ac­tu­ally be able to co­or­di­nate and solve these prob­lems as a re­sult. If I were sub­stan­tially more pes­simistic, I would put more effort into strat­egy and gov­er­nance is­sues. (Not sure I would change what I’m do­ing given my com­par­a­tive ad­van­tage at tech­ni­cal re­search, but it would at least change what I ad­vise other peo­ple do.)

Meta-view on our dis­agree­ment: I sus­pect that you have been talk­ing about the prob­lem of “mak­ing the fu­ture go well” while I’ve been talk­ing about the prob­lem of “get­ting AIs to do what we want” (which do seem like differ­ent prob­lems to me). Most of the prob­lems you’ve been talk­ing about don’t even make it into the bucket of “get­ting AIs to do what we want” the way I think about it, so some of the claims (like “the ur­gent part is in the mo­ti­va­tion sub­prob­lem”) are not meant to quan­tify over the prob­lems you’re iden­ti­fy­ing. I think we do dis­agree on how im­por­tant the prob­lems you iden­tify are, but not as much as you would think, since I’m quite un­cer­tain about this area of prob­lem-space.

• I am weakly op­ti­mistic that hu­mans will ac­tu­ally be able to co­or­di­nate and solve these prob­lems as a re­sult.

Why isn’t that also an ar­gu­ment against the ur­gency of solv­ing AI mo­ti­va­tion? I.e., we don’t need to ur­gently solve AI mo­ti­va­tion be­cause hu­mans will be able to co­or­di­nate to stop or de­lay AI de­vel­op­ment long enough to solve AI mo­ti­va­tion at leisure?

It seems to me that co­or­di­na­tion is re­ally hard. Yes we have to push on that, but we also have to push on po­ten­tial tech­ni­cal solu­tions be­cause most likely co­or­di­na­tion will fail, and there is enough un­cer­tainty about the difficulty of tech­ni­cal solu­tions that I think we ur­gently need more peo­ple to in­ves­ti­gate the prob­lems to see how hard they re­ally are.

Aside from that, I think it’s also re­ally im­por­tant to bet­ter pre­dict/​un­der­stand just how difficult solv­ing those prob­lems are (both so­cially and tech­ni­cally) be­cause that un­der­stand­ing is highly rele­vant to strate­gic de­ci­sions we have to make to­day. For ex­am­ple if those prob­lems are very difficult to solve so that in ex­pec­ta­tion we end up los­ing most of the po­ten­tial value of the uni­verse even if we solve AI mo­ti­va­tion, then that greatly re­duces the value of work­ing on mo­ti­va­tion rel­a­tive to some­thing like pro­duc­ing ev­i­dence of the difficulty of those prob­lems in or­der to con­vince poli­cy­mak­ers to try to co­or­di­nate on stop­ping/​de­lay­ing AI progress, or try­ing to cre­ate a sin­gle­ton AI. That’s why I was ask­ing you for de­tails of what you think the so­cial solu­tions would look like.

so some of the claims (like “the ur­gent part is in the mo­ti­va­tion sub­prob­lem”) are not meant to quan­tify over the prob­lems you’re identifying

I see, in that case I would ap­pre­ci­ate dis­claimers or clearer ways of stat­ing that, so that peo­ple who might want to work on these prob­lems are not dis­cour­aged from do­ing so more strongly than you in­tend.

I’m quite un­cer­tain about this area of prob­lem-space

Ok, I ap­pre­ci­ate that.

• Why isn’t that also an ar­gu­ment against the ur­gency of solv­ing AI mo­ti­va­tion? I.e., we don’t need to ur­gently solve AI mo­ti­va­tion be­cause hu­mans will be able to co­or­di­nate to stop or de­lay AI de­vel­op­ment long enough to solve AI mo­ti­va­tion at leisure?

Two rea­sons come to mind:

• Stop­ping or de­lay­ing AI de­vel­op­ment feels more like try­ing to in­terfere with an already-run­ning pro­cess, whereas there are no ex­ist­ing norms on what we use AI for that we would have to fight against, and de­bates on those norms are already be­gin­ning. For new things, I ex­pect the pub­lic to be par­tic­u­larly risk-averse.

• Re­lat­edly, it is a lot eas­ier to make norms/​laws/​reg­u­la­tions now that bind our fu­ture selves. On an in­di­vi­d­ual level, it seems eas­ier to de­lay your chance of go­ing to Mars if you know you’re go­ing to get a hov­er­car soon. On a so­cietal scale, it seems eas­ier to de­lay space coloniza­tion if we’re go­ing to have lives of leisure due to au­toma­tion, or to de­lay full au­toma­tion if we’re soon go­ing to get 4 hour work­days. Look­ing at the things gov­ern­ments and cor­po­ra­tions say, it seems like they would be likely to do things like this. I think it makes a lot of sense to try and di­rect these efforts at the right tar­get.

I want to em­pha­size though that my method here was hav­ing an in­tu­ition and query­ing for rea­sons be­hind the in­tu­ition. I would be a lit­tle sur­prised if some­one could con­vince me my in­tu­ition is wrong in ~half an hour of con­ver­sa­tion. I would not be sur­prised if some­one could con­vince me that my rea­sons are wrong in ~half an hour of con­ver­sa­tion.

It seems to me that co­or­di­na­tion is re­ally hard. Yes we have to push on that, but we also have to push on po­ten­tial tech­ni­cal solu­tions be­cause most likely co­or­di­na­tion will fail, and there is enough un­cer­tainty about the difficulty of tech­ni­cal solu­tions that I think we ur­gently need more peo­ple to in­ves­ti­gate the prob­lems to see how hard they re­ally are.

I think it would help me if you sug­gested some ways that tech­ni­cal solu­tions could help with these prob­lems. For ex­am­ple, with co­or­di­nat­ing to pre­vent/​de­lay cor­rupt­ing tech­nolo­gies, the fun­da­men­tal prob­lem to me seems to be that with any tech­ni­cal solu­tion, the thing that the AI does will be against the op­er­a­tor’s wishes-upon-re­flec­tion. (If your tech­ni­cal solu­tion is in line with the op­er­a­tor’s wishes-upon-re­flec­tion, then I think you could also solve the prob­lem by solv­ing mo­ti­va­tion.) This seems both hard to de­sign (where does the AI get the in­for­ma­tion about what to do, if not from the op­er­a­tor’s wishes-upon-re­flec­tion?) as well as hard to im­ple­ment (why would the op­er­a­tor use a sys­tem that’s go­ing to do some­thing they don’t want?).

You might ar­gue that there are things that the op­er­a­tor would want if they could get it (eg. global co­or­di­na­tion), but they can’t achieve it now, and so we need a tech­ni­cal solu­tion for that. How­ever, it seems like a we are in the same po­si­tion as a well-mo­ti­vated AI w.r.t. that op­er­a­tor. For ex­am­ple, if we try to cede con­trol to FairBots that ra­tio­nally co­op­er­ate with each other, a well-mo­ti­vated AI could also do that.

Aside from that, I think it’s also re­ally im­por­tant to bet­ter pre­dict/​un­der­stand just how difficult solv­ing those prob­lems are (both so­cially and tech­ni­cally) be­cause that un­der­stand­ing is highly rele­vant to strate­gic de­ci­sions we have to make to­day. For ex­am­ple if those prob­lems are very difficult to solve so that in ex­pec­ta­tion we end up los­ing most of the po­ten­tial value of the uni­verse even if we solve AI mo­ti­va­tion, then that greatly re­duces the value of work­ing on mo­ti­va­tion rel­a­tive to some­thing like pro­duc­ing ev­i­dence of the difficulty of those prob­lems in or­der to con­vince poli­cy­mak­ers to try to co­or­di­nate on stop­ping/​de­lay­ing AI progress, or try­ing to cre­ate a sin­gle­ton AI. That’s why I was ask­ing you for de­tails of what you think the so­cial solu­tions would look like.

Agreed. I view a lot of strat­egy re­search (eg. from FHI and OpenAI) as figur­ing this out from the so­cial side, and some of my op­ti­mism is based on con­ver­sa­tions with those re­searchers. On the tech­ni­cal side, I feel quite stuck (for the rea­sons above), though I haven’t tried hard enough to say that it’s too difficult to do.

I see, in that case I would ap­pre­ci­ate dis­claimers or clearer ways of stat­ing that, so that peo­ple who might want to work on these prob­lems are not dis­cour­aged from do­ing so more strongly than you in­tend.

I’ll keep that in mind. When I wrote the origi­nal com­ment, I wasn’t even think­ing about prob­lems like the ones you men­tion, be­cause I cat­e­go­rize them as “strat­egy” by de­fault, and I was try­ing to talk about the tech­ni­cal prob­lem.

• Stop­ping or de­lay­ing AI de­vel­op­ment feels more like try­ing to in­terfere with an already-run­ning pro­cess, whereas there are no ex­ist­ing norms on what we use AI for that we would have to fight against, and de­bates on those norms are already be­gin­ning. For new things, I ex­pect the pub­lic to be par­tic­u­larly risk-averse.

Do you think that at the time when AI de­vel­op­ment wasn’t an already-run­ning pro­cess, and AI was still a new thing that the pub­lic could be ex­pected to be risk-averse about (when would you say that was?), the ar­gu­ment “work­ing on al­ign­ment isn’t ur­gent be­cause hu­mans can prob­a­bly co­or­di­nate to stop AI de­vel­op­ment” would have been a good one?

Re­lat­edly, it is a lot eas­ier to make norms/​​laws/​​reg­u­la­tions now that bind our fu­ture selves.

Same ques­tion here. Back when “don’t de­velop AI” was still a bind­ing on our fu­ture selves, should we have ex­pected that we will co­or­di­nate to stop AI de­vel­op­ment, and it’s just bad luck that we haven’t suc­ceeded in do­ing that?

Look­ing at the things gov­ern­ments and cor­po­ra­tions say, it seems like they would be likely to do things like this.

Can you be more spe­cific? What global agree­ment do you think would be reached, that is both re­al­is­tic and would solve the kinds of prob­lems that I’m wor­ried about (e.g., un­in­ten­tional cor­rup­tion of hu­mans by “al­igned” AIs who give hu­mans too much power or op­tions that they can’t han­dle, and de­liber­ate ma­nipu­la­tion of hu­mans by un­al­igned AIs or AIs al­igned to other users)?

I think it would help me if you sug­gested some ways that tech­ni­cal solu­tions could help with these prob­lems.

For ex­am­ple, cre­ate an AI that can help the user with philo­soph­i­cal ques­tions at least as much as tech­ni­cal ques­tions. (This could be done for ex­am­ple by figur­ing out how to bet­ter use Iter­ated Am­plifi­ca­tion to an­swer philo­soph­i­cal ques­tions, or how to do imi­ta­tion learn­ing of hu­man philoso­phers, or how to ap­ply in­verse re­in­force­ment learn­ing to philo­soph­i­cal rea­son­ing.) Then the user could ask ques­tions like “Am I likely to be cor­rupted by ac­cess to this tech­nol­ogy? What can I do to pre­vent that while still tak­ing ad­van­tage of it?” Or “Is this just an ex­tremely per­sua­sive at­tempt at ma­nipu­la­tion or an ac­tu­ally good moral ar­gu­ment?”

As an­other ex­am­ple, solve metaethics and build that into the AI so that the AI can figure out or learn the ac­tual ter­mi­nal val­ues of the user, which would make it eas­ier to pro­tect the user from ma­nipu­la­tion and self-cor­rup­tion. And even if the hu­man user is cor­rupted, the AI still has the cor­rect util­ity func­tion, and when it has made enough tech­nolog­i­cal progress it can un­cor­rupt the hu­man.

I view a lot of strat­egy re­search (eg. from FHI and OpenAI) as figur­ing this out from the so­cial side, and some of my op­ti­mism is based on con­ver­sa­tions with those re­searchers.

Can you point me to any rele­vant re­sults that have been writ­ten down, or ex­plain what you learned from those con­ver­sa­tions?

On the tech­ni­cal side, I feel quite stuck (for the rea­sons above), though I haven’t tried hard enough to say that it’s too difficult to do.

To ad­dress this and the ques­tion (from the par­allel thread) of whether you should per­son­ally work on this, I think we need peo­ple to ei­ther solve the tech­ni­cal prob­lems or at least to col­lec­tively try hard enough to con­vinc­ingly say that it’s too difficult to do. (Other­wise who is go­ing to con­vince poli­cy­mak­ers to adopt the very costly so­cial solu­tions? Who is go­ing to con­vince peo­ple to start/​join a so­cial move­ment to in­fluence poli­cy­mak­ers to con­sider those costly so­cial solu­tions? The fact that those things tend to take a lot of time seems like suffi­cient rea­son for ur­gency on the tech­ni­cal side, even if you ex­pect the so­cial solu­tions to be fea­si­ble.) Who are these peo­ple go­ing to be, es­pe­cially the first ones to join the field and help grow it? Prob­a­bly ex­ist­ing AI al­ign­ment re­searchers, right? (I can prob­a­bly make stronger ar­gu­ments in this di­rec­tion but I don’t want to be too “pushy” so I’ll stop here.)

• I for­got to fol­lowup on this im­por­tant part of our dis­cus­sion:

All of these prob­lems that you’re talk­ing about would also ap­ply to tech­nol­ogy that could make a hu­man smarter.

It seems to me that a tech­nol­ogy that could make a hu­man smarter is much more likely (com­pared to AI) to ac­cel­er­ate all forms of in­tel­lec­tual progress (e.g., tech­nolog­i­cal progress and philo­soph­i­cal/​moral progress) about equally, and there­fore would have a less sig­nifi­cant effect on the kinds of prob­lems that I’m talk­ing about (which are largely caused by tech­nolog­i­cal progress out­pac­ing philo­soph­i­cal/​moral progress). I could make some ar­gu­ments about this, but I’m cu­ri­ous if this doesn’t seem ob­vi­ous to you.

As­sum­ing the above, and as­sum­ing that one has moral un­cer­tainty that gives some weight to the con­cept of moral re­spon­si­bil­ity, it seems to me that an ad­di­tional ar­gu­ment for AI re­searchers to work on these prob­lems is that it’s a moral re­spon­si­bil­ity of AI re­searchers/​com­pa­nies to try to solve prob­lems that they cre­ate, for ex­am­ple via tech­nolog­i­cal solu­tions, or by co­or­di­nat­ing amongst them­selves, or by con­vinc­ing poli­cy­mak­ers to co­or­di­nate, or by fund­ing oth­ers to work on these prob­lems, etc., and they are cur­rently ne­glect­ing to do this (es­pe­cially with re­gard to the par­tic­u­lar prob­lems that I’m point­ing out).

• It seems to me that a tech­nol­ogy that could make a hu­man smarter is much more likely (com­pared to AI) to ac­cel­er­ate all forms of in­tel­lec­tual progress (e.g., tech­nolog­i­cal progress and philo­soph­i­cal/​moral progress) about equally, and there­fore would have a less sig­nifi­cant effect on the kinds of prob­lems that I’m talk­ing about (which are largely caused by tech­nolog­i­cal progress out­pac­ing philo­soph­i­cal/​moral progress).

Yes, I agree with this. The rea­son I men­tioned that was to make the point that the prob­lems are a func­tion of progress in gen­eral and aren’t spe­cific to AI—they are just ex­ac­er­bated by AI. I think this is a weak rea­son to ex­pect that solu­tions are likely to come from out­side of AI.

As­sum­ing the above, and as­sum­ing that one has moral un­cer­tainty that gives some weight to the con­cept of moral re­spon­si­bil­ity, it seems to me that an ad­di­tional ar­gu­ment for AI re­searchers to work on these prob­lems is that it’s a moral re­spon­si­bil­ity of AI re­searchers/​com­pa­nies to try to solve prob­lems that they cre­ate, for ex­am­ple via tech­nolog­i­cal solu­tions, or by co­or­di­nat­ing amongst them­selves, or by con­vinc­ing poli­cy­mak­ers to co­or­di­nate, or by fund­ing oth­ers to work on these prob­lems, etc., and they are cur­rently ne­glect­ing to do this.

This seems true. Just to make sure I’m not mi­s­un­der­stand­ing, this was meant to be an ob­ser­va­tion, and not meant to ar­gue that I per­son­ally should pri­ori­tize this, right?

• The rea­son I men­tioned that was to make the point that the prob­lems are a func­tion of progress in gen­eral and aren’t spe­cific to AI—they are just ex­ac­er­bated by AI. I think this is a weak rea­son to ex­pect that solu­tions are likely to come from out­side of AI.

This doesn’t make much sense to me. Why is this any kind of rea­son to ex­pect that solu­tions are likely to come from out­side of AI? Can you give me an anal­ogy where this kind of rea­son­ing more ob­vi­ously makes sense?

Just to make sure I’m not mi­s­un­der­stand­ing, this was meant to be an ob­ser­va­tion, and not meant to ar­gue that I per­son­ally should pri­ori­tize this, right?

Right, this ar­gu­ment wasn’t tar­geted to you, but I think there are other rea­sons for you to per­son­ally pri­ori­tize this. See my com­ment in the par­allel thread.

• It seems to me that “avoid ir­re­versible high-im­pact ac­tions” would only work if one had a small amount of un­cer­tainty over one’s util­ity func­tion, in which case you could just avoid ac­tions that are con­sid­ered “ir­re­versible high-im­pact” by any the util­ity func­tions that you have sig­nifi­cant prob­a­bil­ity mass on. But if you had a large amount of un­cer­tainty, or just have very lit­tle idea what your util­ity func­tion looks like, that doesn’t work be­cause al­most any ac­tion could be “ir­re­versible high-im­pact”.

From the AUP per­spec­tive, this only seems true in a way analo­gous to the state­ment that “any hy­poth­e­sis can have ar­bi­trar­ily long de­scrip­tion length”. It’s pos­si­ble to make prac­ti­cally no as­sump­tions about what the true util­ity func­tion is and still re­cover a sen­si­ble no­tion of “low im­pact”. That is, pe­nal­iz­ing shifts in at­tain­able util­ity for even ran­dom or sim­ple func­tions still yields the de­sired be­hav­ior; I have ex­per­i­men­tal re­sults to this effect which aren’t yet pub­lished. This sug­gests that the no­tion of im­pact cap­tured by AUP isn’t de­pen­dent on re­al­iz­abil­ity of the true util­ity, and hence the broader thing Ro­hin is point­ing at should be doable.

While it’s true that some com­plex value loss is likely to oc­cur when not con­sid­er­ing an ap­pro­pri­ate dis­tri­bu­tion over ex­tremely com­pli­cated util­ity func­tions, it seems by-and-large neg­ligible. This is be­cause such loss oc­curs ei­ther as a con­tinu­a­tion of the sta­tus quo or as a con­se­quence of some­thing ob­jec­tively mild, which seems to cor­re­late strongly with rea­son­ably hu­man-val­ues mild.

• Another con of the mo­ti­va­tion-com­pe­tence de­com­po­si­tion: un­like defi­ni­tion-op­ti­miza­tion, it doesn’t ac­tu­ally seem to be a clean de­com­po­si­tion of the larger task, such that we can solve each sub­task in­de­pen­dently and then com­bine the solu­tions.

For ex­am­ple one way we could solve the mo­ti­va­tion prob­lem is by build­ing a perfect hu­man imi­ta­tion (of some­one who re­ally wants to help H do what H wants), but then we seem to be stuck on the “com­pe­tence” front, and there’s no clear way to plug this solu­tion of “mo­ti­va­tion” into a bet­ter generic solu­tion to “com­pe­tence” to get a more com­pe­tent in­tent-al­igned agent. In­stead it seems like we have to solve the com­pe­tence prob­lem that is par­tic­u­lar to the spe­cific solu­tion to mo­ti­va­tion, or solve mo­ti­va­tion and com­pe­tence to­gether as one large prob­lem.

In con­trast, the prob­lem of spec­i­fy­ing an al­igned util­ity func­tion and the prob­lem of build­ing a safe EU max­i­miz­ers seem to be nat­u­rally in­de­pen­dent prob­lems, such that once we have a speci­fi­ca­tion of an al­igned util­ity func­tion (or a method of spec­i­fy­ing al­igned util­ity func­tions), we can just plug that into more and more pow­er­ful and ro­bust EU max­i­miz­ers.

Fur­ther­more I think this lack of clean de­com­po­si­tion shows up at the con­cep­tual level too, not just the prag­matic level. For ex­am­ple, sup­pose we tried to in­crease the com­pe­tence of the hu­man imi­ta­tion by com­bin­ing it with a su­per­in­tel­li­gent Or­a­cle, and it turns out the hu­man imi­ta­tion isn’t very care­ful and in most timelines de­stroys the world by ask­ing un­safe ques­tions that cause the Or­a­cle to perform ma­lign op­ti­miza­tions. Is this a failure of mo­ti­va­tion or a failure of com­pe­tence, or both? It seems ar­guable or hard to say. In con­trast, in a sys­tem that is built us­ing the defi­ni­tion-op­ti­miza­tion de­com­po­si­tion, it seems like it would be easy to trace any safety failures to ei­ther the “defi­ni­tion” solu­tion or the “op­ti­miza­tion” solu­tion.

• I over­all agree that this is a con. Cer­tainly there are AI sys­tems that are weak enough that you can’t talk co­her­ently about their “mo­ti­va­tion”. Prob­a­bly all deep-learn­ing-based sys­tems fall into this cat­e­gory.

I also agree that (at least for now, and prob­a­bly in the fu­ture as well) you can’t for­mally spec­ify the “type sig­na­ture” of mo­ti­va­tion such that you could sep­a­rately solve the com­pe­tence prob­lem with­out know­ing the de­tails of the solu­tion to the mo­ti­va­tion prob­lem.

My hope here would be to solve the mo­ti­va­tion prob­lem and leave the com­pe­tence prob­lem for later, since by my view that solves most of the prob­lem (I’m aware that you dis­agree with this).

I don’t agree that it’s not clean at the con­cep­tual level. It’s per­haps less clean than the defi­ni­tion-op­ti­miza­tion de­com­po­si­tion, but not much less.

For ex­am­ple, sup­pose we tried to in­crease the com­pe­tence of the hu­man imi­ta­tion by com­bin­ing it with a su­per­in­tel­li­gent Or­a­cle, and it turns out the hu­man imi­ta­tion isn’t very care­ful and in most timelines de­stroys the world by ask­ing un­safe ques­tions that cause the Or­a­cle to perform ma­lign op­ti­miza­tions. Is this a failure of mo­ti­va­tion or a failure of com­pe­tence, or both?

This seems pretty clearly like a failure of com­pe­tence to me, since the hu­man imi­ta­tion would (pre­sum­ably) say that they don’t want the world to be de­stroyed, and they (pre­sum­ably) did not pre­dict that that was what would hap­pen when they queried the or­a­cle.

• This seems pretty clearly like a failure of com­pe­tence to me, since the hu­man imi­ta­tion would (pre­sum­ably) say that they don’t want the world to be de­stroyed, and they (pre­sum­ably) did not pre­dict that that was what would hap­pen when they queried the or­a­cle.

It also seems like a failure of mo­ti­va­tion though, be­cause as soon as the Or­a­cle started to do ma­lign op­ti­miza­tion, the sys­tem as a whole is no longer try­ing to do what H wants.

Or is the idea that as long as the top-level or ini­tial op­ti­mizer is try­ing (or tried) to do what H wants, then all sub­se­quent failures of mo­ti­va­tion don’t count, so we’re ex­clud­ing prob­lems like in­ner al­ign­ment from mo­ti­va­tion /​ in­tent al­ign­ment?

I’m un­sure what your an­swer would be, and what Paul’s an­swer would be, and whether they would be the same, which at least sug­gests that the con­cepts haven’t been cleanly de­com­posed yet.

ETA: Or to put it an­other way, sup­posed AI safety re­searchers de­ter­mined ahead of time what kinds of ques­tions won’t cause the Or­a­cle to perform ma­lign op­ti­miza­tions. Would that not count as part of the solu­tion to mo­ti­va­tion /​ in­tent al­ign­ment of this sys­tem (i.e., com­bi­na­tion of hu­man imi­ta­tion and Or­a­cle)? It seems re­ally coun­ter­in­tu­itive if the an­swer is “no”.

• Oh, I see, you’re talk­ing about the sys­tem as a whole, whereas I was think­ing of the hu­man imi­ta­tion speci­fi­cally. That seems like a mul­ti­a­gent sys­tem and I wouldn’t ap­ply sin­gle-agent rea­son­ing to it, so I agree mo­ti­va­tion-com­pe­tence is not the right way to think about it (but if you in­sisted on it, I’d say it fails mo­ti­va­tion, mostly be­cause the sys­tem doesn’t re­ally have a sin­gle “mo­ti­va­tion”).

It doesn’t seem like the defi­ni­tion-op­ti­miza­tion de­com­po­si­tion helps ei­ther? I don’t know whether I’d call that a failure of defi­ni­tion or op­ti­miza­tion.

Or to put it an­other way, sup­posed AI safety re­searchers de­ter­mined ahead of time what kinds of ques­tions won’t cause the Or­a­cle to perform ma­lign op­ti­miza­tions. Would that not count as part of the solu­tion to mo­ti­va­tion /​ in­tent al­ign­ment of this sys­tem (i.e., com­bi­na­tion of hu­man imi­ta­tion and Or­a­cle)?

I would say the hu­man imi­ta­tion was in­tent al­igned, and this helped im­prove the com­pe­tence of the hu­man imi­ta­tion. I mostly wouldn’t ap­ply this frame­work to the sys­tem (and I also wouldn’t ap­ply defi­ni­tion-op­ti­miza­tion to the sys­tem).

• That seems like a mul­ti­a­gent sys­tem and I wouldn’t ap­ply sin­gle-agent rea­son­ing to it, so I agree mo­ti­va­tion-com­pe­tence is not the right way to think about it

This was an un­ex­pected an­swer. Isn’t HCH also such a mul­ti­a­gent sys­tem? (It seems very similar to what I de­scribed: a hu­man with ac­cess to a su­per­hu­man Or­a­cle, al­though HCH wasn’t what I ini­tially had in mind.) IDA should con­verge to HCH in the limit of in­finite com­pute and train­ing data, so this would seem to im­ply that the mo­ti­va­tion-com­pe­tence frame­work doesn’t ap­ply to IDA ei­ther. I’m pretty sure Paul would give a differ­ent an­swer, if we ask him about “in­tent al­ign­ment”.

It doesn’t seem like the defi­ni­tion-op­ti­miza­tion de­com­po­si­tion helps ei­ther? I don’t know whether I’d call that a failure of defi­ni­tion or op­ti­miza­tion.

It seems more ob­vi­ous that mul­ti­a­gent sys­tems just fall out­side of the defi­ni­tion-op­ti­miza­tion frame­work, which seems to be a point in its fa­vor as far as con­cep­tual clar­ity is con­cerned.

• I’m pretty sure Paul would give a differ­ent an­swer, if we ask him about “in­tent al­ign­ment”.

Yes, I’d say that to the ex­tent that “try­ing to do X” is a use­ful con­cept, it ap­plies to sys­tems with lots of agents just as well as it ap­plies to one agent.

Even a very the­o­ret­i­cally sim­ple sys­tem like AIXI doesn’t seem to be “try­ing” to do just one thing, in the sense that it can e.g. ex­ert con­sid­er­able op­ti­miza­tion power at things other than re­ward, even in cases where the sys­tem seems to “know” that its ac­tions won’t lead to re­ward.

You could say that AIXI is “op­ti­miz­ing” the right thing and just mess­ing up when it suffers in­ner al­ign­ment failures, but I’m not con­vinced that this di­vi­sion is ac­tu­ally do­ing much use­ful work. I think it’s mean­ingful to say “defin­ing what we want is use­ful,” but be­yond that it doesn’t seem like a work­able way to ac­tu­ally an­a­lyze the hard parts of al­ign­ment or di­vide up the prob­lem.

(For ex­am­ple, I think we can likely get OK defi­ni­tions of what we value, along the lines of A For­mal­iza­tion of Indi­rect Nor­ma­tivity, but I’ve mostly stopped work­ing along these lines be­cause it no longer seems di­rectly use­ful.)

It seems more ob­vi­ous that mul­ti­a­gent sys­tems just falls out­side of the defi­ni­tion-op­ti­miza­tion frame­work, which seems to be a point in its fa­vor as far as con­cep­tual clar­ity is con­cerned.

I agree.

Of course, it also seems quite likely that AIs of the kind that will prob­a­bly be built (“by de­fault”) also fall out­side of the defi­ni­tion-op­ti­miza­tion frame­work. So adopt­ing this frame­work as a way to an­a­lyze po­ten­tial al­igned AIs seems to amount to nar­row­ing the space con­sid­er­ably.

• Yes, I’d say that to the ex­tent that “try­ing to do X” is a use­ful con­cept, it ap­plies to sys­tems with lots of agents just as well as it ap­plies to one agent.

So how do you see it ap­ply­ing in my ex­am­ple? Would you say that the sys­tem in my ex­am­ple is both try­ing to do what H wants it to do, and also try­ing to do some­thing that H doesn’t want? Is it in­tent al­igned pe­riod, or in­tent al­igned at some points in time and not at oth­ers, or si­mul­ta­neously in­tent al­igned and not al­igned, or some­thing else? (I feel like we’ve had a similar dis­cus­sion be­fore and ei­ther it didn’t get re­solved or I didn’t un­der­stand your po­si­tion. I didn’t see a di­rect at­tempt to an­swer this in the com­ment I’m re­ply­ing to, and it’s fine if you don’t want to go down this road again but I want to con­vey my con­tinued con­fu­sion.)

You could say that AIXI is “op­ti­miz­ing” the right thing and just mess­ing up when it suffers in­ner al­ign­ment failures, but I’m not con­vinced that this di­vi­sion is ac­tu­ally do­ing much use­ful work. I think it’s mean­ingful to say “defin­ing what we want is use­ful,” but be­yond that it doesn’t seem like a work­able way to ac­tu­ally an­a­lyze the hard parts of al­ign­ment or di­vide up the prob­lem.

I don’t un­der­stand how this is con­nected to what I was say­ing. (In gen­eral I of­ten find it sig­nifi­cantly harder to un­der­stand your com­ments com­pared to say Ro­hin’s. Not nec­es­sar­ily say­ing you should do some­thing differ­ently, as you might already be mak­ing a difficult trade­off be­tween how much time to spend here and el­se­where, but just offer­ing feed­back in case you didn’t re­al­ize.)

Of course, it also seems quite likely that AIs of the kind that will prob­a­bly be built (“by de­fault”) also fall out­side of the defi­ni­tion-op­ti­miza­tion frame­work. So adopt­ing this frame­work as a way to an­a­lyze po­ten­tial al­igned AIs seems to amount to nar­row­ing the space con­sid­er­ably.

This makes sense.

• Would you say that the sys­tem in my ex­am­ple is both try­ing to do what H wants it to do, and also try­ing to do some­thing that H doesn’t want? Is it in­tent al­igned pe­riod, or in­tent al­igned at some points in time and not at oth­ers, or si­mul­ta­neously in­tent al­igned and not al­igned, or some­thing else?

The or­a­cle is not al­igned when asked ques­tions that cause it to do ma­lign op­ti­miza­tion.

The hu­man+or­a­cle sys­tem is not al­igned in situ­a­tions where the hu­man would pose such ques­tions.

For a co­her­ent sys­tem (e.g. a mul­ti­a­gent sys­tem which has con­verged to a Pareto effi­cient com­pro­mise), it make sense to talk about the one thing that it is try­ing to do.

For an in­co­her­ent sys­tem this ab­strac­tion may not make sense, and a sys­tem may be try­ing to do lots of things. I try to use be­nign when talk­ing about pos­si­bly-in­co­her­ent sys­tems, or things that don’t even re­sem­ble op­ti­miz­ers.

The defi­ni­tion in this post is a bit sloppy here, but I’m usu­ally imag­in­ing that we are build­ing roughly-co­her­ent AI sys­tems (and that if they are in­co­her­ent, some parts are ma­lign). If you wanted to be a bit more care­ful with the defi­ni­tion, and want to ad­mit vague­ness in “what H wants it to do” (such that there can be sev­eral differ­ent prefer­ences that are “what H wants”) we could say some­thing like:

A is al­igned with H if ev­ery­thing it is try­ing to do is “what H wants.”

That’s not great ei­ther though (and I think the origi­nal post is more at an ap­pro­pri­ate level of at­tempted-pre­ci­sion).

• (In the fol­low­ing I will also use “al­igned” to mean “in­tent al­igned”.)

The hu­man+or­a­cle sys­tem is not al­igned in situ­a­tions where the hu­man would pose such ques­tions.

Ok, sounds like “in­tent al­igned at some points in time and not at oth­ers” was the clos­est guess. To con­firm, would you en­dorse “the sys­tem was al­igned when the hu­man imi­ta­tion was still try­ing to figure out what ques­tions to ask the or­a­cle (since the sys­tem was still only try­ing to do what H wants), and then due to its own in­com­pe­tence be­came not al­igned when the or­a­cle started work­ing on the un­safe ques­tion”?

Given that in­tent al­ign­ment in this sense seems to be prop­erty of a sys­tem+situ­a­tion in­stead of the sys­tem it­self, how would you define when the “in­tent al­ign­ment prob­lem” has been solved for an AI, or when would you call an AI (such as IDA) it­self “in­tent al­igned”? (When we can rea­son­ably ex­pect to keep it out of situ­a­tions where its al­ign­ment fails, for some rea­son­able amount of time, per­haps?) Or is it the case that when­ever you use “in­tent al­ign­ment” you always have some spe­cific situ­a­tion or set of situ­a­tions in mind?

• Fwiw hav­ing read this ex­change, I think I ap­prox­i­mately agree with Paul. Go­ing back to the origi­nal re­sponse to my com­ment:

Isn’t HCH also such a mul­ti­a­gent sys­tem?

Yes, I shouldn’t have made a cat­e­gor­i­cal state­ment about mul­ti­a­gent sys­tems. What I should have said was that the par­tic­u­lar mul­ti­a­gent sys­tem you pro­posed did not have a sin­gle thing it is “try­ing to do”, i.e. I wouldn’t say it has a sin­gle “mo­ti­va­tion”. This al­lows you to say “the sys­tem is not in­tent-al­igned”, even though you can’t say “the sys­tem is try­ing to do X”.

Another way of say­ing this is that it is an in­co­her­ent sys­tem and so the mo­ti­va­tion ab­strac­tion /​ mo­ti­va­tion-com­pe­tence de­com­po­si­tion doesn’t make sense, but HCH is one of the few mul­ti­a­gent sys­tems that is co­her­ent. (Idk if I be­lieve that claim, but it seems plau­si­ble.) This seems to map on to the state­ment:

For an in­co­her­ent sys­tem this ab­strac­tion may not make sense, and a sys­tem may be try­ing to do lots of things.

Also, I want to note strong agree­ment with this:

Of course, it also seems quite likely that AIs of the kind that will prob­a­bly be built (“by de­fault”) also fall out­side of the defi­ni­tion-op­ti­miza­tion frame­work. So adopt­ing this frame­work as a way to an­a­lyze po­ten­tial al­igned AIs seems to amount to nar­row­ing the space con­sid­er­ably.
• Another way of say­ing this is that it is an in­co­her­ent sys­tem and so the mo­ti­va­tion ab­strac­tion /​ mo­ti­va­tion-com­pe­tence de­com­po­si­tion doesn’t make sense, but HCH is one of the few mul­ti­a­gent sys­tems that is co­her­ent.

HCH can be in­co­her­ent. I think one ex­am­ple that came up in an ear­lier dis­cus­sion was the top node in HCH try­ing to help the user by ask­ing (due to in­com­pe­tence /​ in­suffi­cient un­der­stand­ing of cor­rigi­bil­ity) “What is a good ap­prox­i­ma­tion of the user’s util­ity func­tion?” fol­lowed by “What ac­tion would max­i­mize EU ac­cord­ing to this util­ity func­tion?”

ETA: If this isn’t clearly in­co­her­ent, imag­ine that due to fur­ther in­com­pe­tence, lower nodes work on sub­goals in a way that con­flict with each other.

• I do think that some term needs to re­fer to this prob­lem, to sep­a­rate it from other prob­lems like “un­der­stand­ing what hu­mans want,” “solv­ing philos­o­phy,” etc.

Worth not­ing here that (it looks like) Paul even­tu­ally set­tled upon “in­tent al­ign­ment” as the term for this.

• I think that us­ing a broader defi­ni­tion (or the de re read­ing) would also be defen­si­ble, but I like it less be­cause it in­cludes many sub­prob­lems that I think (a) are much less ur­gent, (b) are likely to in­volve to­tally differ­ent tech­niques than the ur­gent part of al­ign­ment.

I think it would be helpful for un­der­stand­ing your po­si­tion and what you mean by “AI al­ign­ment” to have a list or sum­mary of those other sub­prob­lems and why you think they’re much less ur­gent. Can you link to or give one here?

Also, do you have a prefered term for the broader defi­ni­tion, or the de re read­ing? What should we call those things if not “AI al­ign­ment”?

• I think it would be helpful for un­der­stand­ing your po­si­tion and what you mean by “AI al­ign­ment” to have a list or sum­mary of those other sub­prob­lems and why you think they’re much less ur­gent. Can you link to or give one here?

Other prob­lems re­lated to al­ign­ment, which would be in­cluded by the broad­est defi­ni­tion of “ev­ery­thing re­lated to mak­ing the fu­ture good.”

• We face a bunch of prob­lems other than AI al­ign­ment (e.g. other de­struc­tive tech­nolo­gies, risk of value drift), and de­pend­ing on the com­pe­ten­cies of our AI sys­tems they may be bet­ter or worse than hu­mans at helping han­dle those prob­lems (rel­a­tive to ac­cel­er­at­ing the kinds of progress that force us to con­front those prob­lems). So we’d like AI to be bet­ter at (helping us with) {diplo­macy, re­flec­tion, in­sti­tu­tion de­sign, philos­o­phy...} rel­a­tive to {phys­i­cal tech­nol­ogy, so­cial ma­nipu­la­tion, lo­gis­tics...}

• Beyond al­ign­ment, AI may provide new ad­van­tages to ac­tors who are able to make their val­ues more ex­plicit, or who have ex­plicit norms for bar­gain­ing/​ag­gre­ga­tion, and so we may want to figure out how to make more things more ex­plicit.

• AI could fa­cil­i­tate so­cial con­trol, ma­nipu­la­tion, or lock-in, which may make it more im­por­tant for us to have more ro­bust or rapid forms of de­liber­a­tion (that are ro­bust to con­trol/​ma­nipu­la­tion, or that can run their course fast enough to pre­vent some­one from mak­ing a mis­take). This also may in­crease the in­cen­tives for or­di­nary con­flict amongst ac­tors with differ­ing long-term val­ues.

• AI will tend to em­power groups with few peo­ple (but lots of re­sources), mak­ing it eas­ier for some­one to de­stroy the world and so re­quiring stronger en­force­ment/​sta­bi­liza­tion.

• AI may be an un­usu­ally good op­por­tu­nity for world sta­bi­liza­tion, e.g. be­cause its as­so­ci­ated with a dis­rup­tive tran­si­tion, in which case some­one may want to take that op­por­tu­nity. (Though I’m con­cerned about this be­cause, in light of dis­agree­ment/​con­flict about sta­bi­liza­tion it­self, some­one at­tempt­ing to do this or be­ing ex­pected to at­tempt to do this could un­der­mine our abil­ity to solve al­ign­ment.)

That’s a very par­tial list. This is for the broad­est defi­ni­tion of “ev­ery­thing about AI that is rele­vant to mak­ing the fu­ture good,” which I don’t think is par­tic­u­larly defen­si­ble. I’d say the first three could be in­cluded in defen­si­ble defi­ni­tions of al­ign­ment, and there are plenty of oth­ers.

My ba­sic po­si­tion on most of these prob­lems is: “they are fine prob­lems and you might want to work on them, but if some­one is go­ing to claim they are im­por­tant they need to give a sep­a­rate ar­gu­ment, it’s not at all im­plied by the nor­mal ar­gu­ment for the im­por­tance of al­ign­ment.” I can ex­plain in par­tic­u­lar cases why I think other prob­lems are less im­por­tant, and I feel like we’ve had a lot of back and forth on some of these, but the only gen­eral ar­gu­ment is that I think there are strong rea­sons to care about al­ign­ment in par­tic­u­lar that don’t ex­tend to these other prob­lems (namely, a failure to solve al­ign­ment has pre­dictable re­ally bad con­se­quences in the short term, and cur­rently it looks very tractable in ex­pec­ta­tion).

Also, do you have a preferred term for the broader defi­ni­tion, or the de re read­ing? What should we call those things if not “AI al­ign­ment”?

Which broader defi­ni­tion? There are tons of pos­si­bil­ities. I think the one given in this post is the clos­est to a co­her­ent defi­ni­tion that matches ex­ist­ing us­age.

The other com­mon defi­ni­tion seems to be more along the lines of “ev­ery­thing re­lated to make AI go well” which I don’t think re­ally de­serves a word—just call that “AI tra­jec­tory change” if you want to dis­t­in­guish it from “AI speedup”, or “pro-so­cial AI” if you want to dis­t­in­guish from “AI as an in­tel­lec­tual cu­ri­os­ity,” or just “AI” if you don’t care about those dis­tinc­tions.

For the de re read­ing, I don’t see much mo­tive to lump the com­pe­tence and al­ign­ment parts of the prob­lem into a sin­gle head­ing, I would just call them “al­ign­ment” and “value learn­ing” sep­a­rately. But I can see how this might seem like a value judg­ment, since some­one who thought that these two prob­lems were the very most im­por­tant prob­lems might want to put them un­der a sin­gle head­ing even if they didn’t think there would be par­tic­u­lar tech­ni­cal over­lap.

(ETA: I’d also be OK with say­ing “de dicto al­ign­ment” or “de re al­ign­ment,” since they re­ally are just im­por­tantly differ­ent con­cepts both of which are used rel­a­tively fre­quently—there is a big differ­ence be­tween an em­ployee who de dicto wants the same things their boss wants, and an em­ployee who de re wants to help their boss get what they want, those feel like two species of al­ign­ment.)

• Is there a con­cept of a safe par­tially al­igned AI? Where it rec­og­nizes its own limi­ta­tions of un­der­stand­ing of the hu­man[-ity] and limit its ac­tions to what it knows is within those limits with high prob­a­bil­ity?

• I’m not tech. savvy and am well aware that maybe it’s a lack of un­der­stand­ing that lets me live with­out fear of AI but it seems an im­por­tant is­sue round here and I would like to have some un­der­stand­ing. And a lit­tle un­der­stand­ing of my per­spec­tive—I grew up in shadow of the Cold War i.e. mu­tu­ally as­sured de­struc­tion in 6 min­utes or less (it might have been 12 min­utes—I can’t quite re­mem­ber any­more).

This post caught my eye on the re­view list.

I need to clar­ify some­thing be­fore read­ing for­ward.

get­ting your AI to try to do the right thing,

Is: ‘get­ting your AI to try to do the WANTED thing’ be the ac­cu­rate word­ing?

The us­age of “right” adds a di­men­sion of moral­ity in my mind that doesn’t come with “want”.

• Yeah, it’s not meant to add that di­men­sion of moral­ity.

Per­haps it should be “get­ting your AI to try to help you”. Try­ing to do the “wanted” thing is also rea­son­able.

• Are there any plans to gen­er­al­ize this kind of al­ign­ment later to in­clude CEV or some other plau­si­ble metaethics, or should this be “the fi­nal stop”?