# AlphaGo Zero and capability amplification

AlphaGo Zero is an im­pres­sive demon­stra­tion of AI ca­pa­bil­ities. It also hap­pens to be a nice proof-of-con­cept of a promis­ing al­ign­ment strat­egy.

## How AlphaGo Zero works

AlphaGo Zero learns two func­tions (which take as in­put the cur­rent board):

• A prior over moves p is trained to pre­dict what AlphaGo will even­tu­ally de­cide to do.

• A value func­tion v is trained to pre­dict which player will win (if AlphaGo plays both sides)

Both are trained with su­per­vised learn­ing. Once we have these two func­tions, AlphaGo ac­tu­ally picks it moves by us­ing 1600 steps of Monte Carlo tree search (MCTS), us­ing p and v to guide the search. It trains p to by­pass this ex­pen­sive search pro­cess and di­rectly pick good moves. As p im­proves, the ex­pen­sive search be­comes more pow­er­ful, and p chases this mov­ing tar­get.

## Iter­ated ca­pa­bil­ity amplification

In the sim­plest form of iter­ated ca­pa­bil­ity am­plifi­ca­tion, we train one func­tion:

• A “weak” policy A, which is trained to pre­dict what the agent will even­tu­ally de­cide to do in a given situ­a­tion.

Just like AlphaGo doesn’t use the prior p di­rectly to pick moves, we don’t use the weak policy A di­rectly to pick ac­tions. In­stead, we use a ca­pa­bil­ity am­plifi­ca­tion scheme: we call A many times in or­der to pro­duce more in­tel­li­gent judg­ments. We train A to by­pass this ex­pen­sive am­plifi­ca­tion pro­cess and di­rectly make in­tel­li­gent de­ci­sions. As A im­proves, the am­plified policy be­comes more pow­er­ful, and A chases this mov­ing tar­get.

In the case of AlphaGo Zero, A is the prior over moves, and the am­plifi­ca­tion scheme is MCTS. (More pre­cisely: A is the pair (p, v), and the am­plifi­ca­tion scheme is MCTS + us­ing a rol­lout to see who wins.)

Out­side of Go, A might be a ques­tion-an­swer­ing sys­tem, which can be ap­plied sev­eral times in or­der to first break a ques­tion down into pieces and then sep­a­rately an­swer each com­po­nent. Or it might be a policy that up­dates a cog­ni­tive workspace, which can be ap­plied many times in or­der to “think longer” about an is­sue.

## The significance

Re­in­force­ment learn­ers take a re­ward func­tion and op­ti­mize it; un­for­tu­nately, it’s not clear where to get a re­ward func­tion that faith­fully tracks what we care about. That’s a key source of safety con­cerns.

By con­trast, AlphaGo Zero takes a policy-im­prove­ment-op­er­a­tor (like MCTS) and con­verges to­wards a fixed point of that op­er­a­tor. If we can find a way to im­prove a policy while pre­serv­ing its al­ign­ment, then we can ap­ply the same al­gorithm in or­der to get very pow­er­ful but al­igned strate­gies.

Us­ing MCTS to achieve a sim­ple goal in the real world wouldn’t pre­serve al­ign­ment, so it doesn’t fit the bill. But “think longer” might. As long as we start with a policy that is close enough to be­ing al­igned — a policy that “wants” to be al­igned, in some sense — al­low­ing it to think longer may make it both smarter and more al­igned.

I think de­sign­ing al­ign­ment-pre­serv­ing policy am­plifi­ca­tion is a tractable prob­lem to­day, which can be stud­ied ei­ther in the con­text of ex­ist­ing ML or hu­man co­or­di­na­tion. So I think it’s an ex­cit­ing di­rec­tion in AI al­ign­ment. A can­di­date solu­tion could be in­cor­po­rated di­rectly into the AlphaGo Zero ar­chi­tec­ture, so we can already get em­piri­cal feed­back on what works. If by good for­tune pow­er­ful AI sys­tems look like AlphaGo Zero, then that might get us much of the way to an al­igned AI.

This is to­tally wild spec­u­la­tion, but the thought oc­curred to me whether the hu­man brain might be do­ing some­thing like this with iden­tities and so­cial roles:

A lot of (but not all) peo­ple get a strong hit of this when they go back to visit their fam­ily. If you move away and then make new friends and sort of be­come a new per­son (!), you might at first think this is just who you are now. But then you visit your par­ents… and sud­denly you feel and act a lot like you did be­fore you moved away. You might even try to hold onto this “new you” with them… and they might re­spond to what they see as strange be­hav­ior by try­ing to nudge you into act­ing “nor­mal”: ig­nor­ing sur­pris­ing things you say, chang­ing the topic to some­thing fa­mil­iar, start­ing an old fight, etc. [...]
For in­stance, the stereo­typ­i­cal story of the wor­ried nag­ging wife con­fronting the emo­tion­ally dis­tant hus­band as he comes home re­ally late from work… is ac­tu­ally a pretty good car­i­ca­ture of a script that lots of cou­ples play out, as long as you know to ig­nore the gen­der and class as­sump­tions em­bed­ded in it.
But it’s hard to sort this out with­out just en­act­ing our scripts. The ver­sion of you that would be think­ing about it is your char­ac­ter, which (in this frame­work) can ac­cu­rately un­der­stand its own role only if it has enough slack to be­come genre-savvy within the web; oth­er­wise it just keeps play­ing out its role. In the hus­band/​wife script men­tioned above, there’s a ten­dency for the “wife” to get ex­cited when “she” learns about the re­la­tion­ship script, be­cause it looks to “her” like it sug­gests how to save the re­la­tion­ship — which is “her” en­act­ing “her” role. This of­ten ag­gra­vates the fears of the “hus­band”, caus­ing “him” to pull away and act dis­mis­sive of the script’s rele­vance (which is “his” role), driv­ing “her” to in­sist that they just need to talk about this… which is the same pat­tern they were in be­fore. They try to be­come genre-savvy, but there (usu­ally) just isn’t enough slack be­tween them, so the effort merely changes the topic while they play out their usual scene.

If you squint, you could kind of in­ter­pret this kind of a dy­namic to be a re­sult of the hu­man brain try­ing to pre­dict what it ex­pects it­self to do next, us­ing that pre­dic­tion to guide the search of next ac­tions, and then end­ing up with next ac­tions that have a strong struc­tural re­sem­blance to its pre­vi­ous ones. (Though I can also think of maybe bet­ter-fit­ting mod­els of this too; still, seemed worth throw­ing out.)

• Us­ing MCTS to achieve a sim­ple goal in the real world wouldn’t pre­serve al­ign­ment, so it doesn’t fit the bill.

Also, an ar­bi­trary su­per­vised learn­ing step that up­dates and is not safe. Gen­er­ally, mak­ing that Distill step safe seems to me like the hard­est challenge of the iter­ated ca­pa­bil­ity am­plifi­ca­tion ap­proach. Are there already re­search di­rec­tions for tack­ling that challenge? (if I un­der­stand cor­rectly, your re­cent pa­per did not fo­cus on it).

• Thank you.

I see how the di­rec­tions pro­posed there (ad­ver­sar­ial train­ing, ver­ifi­ca­tion, trans­parency) can be use­ful for cre­at­ing al­igned sys­tems. But if we use a Distill step that can be trusted to be safe via one or more of those ap­proaches, I find it im­plau­si­ble that Am­plifi­ca­tion would yield sys­tems that are com­pet­i­tive rel­a­tive to the most pow­er­ful ones cre­ated by other ac­tors around the same time (i.e. ac­tors that cre­ate AI sys­tems with­out any safety-mo­ti­vated re­stric­tions on the model space and search al­gorithm).

• Paul’s po­si­tion in that post was:

All of these ap­proaches feel very difficult, but I don’t think we’ve run into con­vinc­ing deal-break­ers.

I think this is meant to in­clude the difficulty of mak­ing them com­pet­i­tive with un­al­igned ML, since that has been his stated goal. If you can ar­gue that we should be even more pes­simistic than this, I’m sure a lot of peo­ple would find that in­ter­est­ing.

• In this 2017 post about Am­plifi­ca­tion (linked from OP) Paul wrote: “I think there is a very good chance, per­haps as high as 50%, that this ba­sic strat­egy can even­tu­ally be used to train be­nign state-of-the-art model-free RL agents.”

The post you linked to is more re­cent, so ei­ther the quote in your com­ment re­flects an up­date or Paul has other in­sights/​es­ti­mates about safe Distill steps.

BTW, I think Am­plifi­ca­tion might cur­rently be the most promis­ing ap­proach for cre­at­ing al­igned and pow­er­ful sys­tems; what I ar­gue is that in or­der to save the world it will prob­a­bly need to be com­ple­mented with gov­er­nance solu­tions.

• BTW, I think Am­plifi­ca­tion might cur­rently be the most promis­ing ap­proach for cre­at­ing al­igned and pow­er­ful sys­tems; what I ar­gue is that in or­der to save the world it will prob­a­bly need to be com­ple­mented with gov­er­nance solu­tions.

How un­com­pet­i­tive do you think al­igned IDA agents will be rel­a­tive to un­al­igned agents, and what kinds of gov­er­nance solu­tions do you think that would call for? Also, I should have made this clearer last time, but I’d be in­ter­ested to hear more about why you think Distill prob­a­bly can’t be made both safe and com­pet­i­tive, re­gard­less of whether you’re more or less op­ti­mistic than Paul.

• How un­com­pet­i­tive do you think al­igned IDA agents will be rel­a­tive to un­al­igned agents

For the sake of this es­ti­mate I’m us­ing a defi­ni­tion of IDA that is prob­a­bly nar­rower than what Paul has in mind: in the defi­ni­tion I use here, the Distill steps are car­ried out by noth­ing other than su­per­vised learn­ing + what it takes to make that su­per­vised learn­ing safe (but the im­ple­men­ta­tion of the Distill steps may be im­proved dur­ing the Am­plify steps).

This nar­row defi­ni­tion might not in­clude the most promis­ing fu­ture di­rec­tions of IDA (e.g. maybe the Distill steps should be car­ried out by some other pro­cess that in­volves hu­mans). Without this sim­plify­ing as­sump­tion, one might define IDA as broadly as: “iter­a­tively cre­ate stronger and stronger safe AI sys­tems by us­ing all the re­sources and tools that you cur­rently have”. Car­ry­ing out that Broad IDA ap­proach might in­clude efforts like ask­ing AI al­ign­ment re­searchers to get into a room with a white­board and come up with ideas for new ap­proaches.

There­for this es­ti­mate uses my nar­row defi­ni­tion of IDA. If you like, I can also an­swer the gen­eral ques­tion: “How un­com­pet­i­tive do you think al­igned agents will be rel­a­tive to un­al­igned agents?”.

My es­ti­mate:

Sup­pose it is the case that if OpenAI de­cided to cre­ate an AGI agent as soon as they could, it would have taken them X years (as­sum­ing an an­nual bud­get of $10M and that the world around them stays the same, and OpenAI doesn’t do neu­ro­science, and no un­in­ten­tional dis­asters hap­pen). Now sup­pose that OpenAI de­cided to cre­ate an al­igned IDA agent with AGI ca­pa­bil­ities as soon as they could (same con­di­tions). How much time would it take them? My es­ti­mate fol­lows; each en­try is in the for­mat: [years]: [my cre­dence that it would take them at most that many years] (con­sider writ­ing down your own cre­dences be­fore look­ing at mine) 1.0X: 0.1% 1.1X: 3% 1.2X: 3% 1.5X: 4% 2X: 5% 5X: 10% 10X: 30% 100X: 60% • Gen­er­ally, I don’t see why we should ex­pect that the most ca­pa­ble sys­tems that can be cre­ated with su­per­vised learn­ing (e.g. by us­ing RL to search over an ar­bi­trary space of NN ar­chi­tec­tures) would perform similarly to the most ca­pa­ble sys­tems that can be cre­ated, at around the same time, us­ing some re­stricted su­per­vised learn­ing that hu­mans must trust to be safe. My prior is that the former is very likely to out­perform by a lot, and I’m not aware of strong ev­i­dence point­ing one way or an­other. So for ex­am­ple, I ex­pect that an al­igned IDA agent will be out­performed by an agent that was cre­ated by that same IDA frame­work when re­plac­ing the most ca­pa­ble safe su­per­vised learn­ing in the Distill steps with the most ca­pa­ble un­re­stricted su­per­vised learn­ing available at around the same time. How un­com­pet­i­tive do you think al­igned IDA agents will be rel­a­tive to un­al­igned agents I think they will prob­a­bly be un­com­pet­i­tive enough to make some com­ple­men­tary gov­er­nance solu­tions nec­es­sary (this line re­placed an at­tempt for a quan­ti­ta­tive an­swer which turned out long; let me know if you want it). what kinds of gov­er­nance solu­tions do you think that would call for? I’m very un­cer­tain. It might be the case that our world must stop be­ing a place in which any­one with$10M can pur­chase mil­lions of GPU hours. I’m aware that most peo­ple in the AI safety com­mu­nity are ex­tremely skep­ti­cal about gov­ern­ments car­ry­ing out “sta­bi­liza­tion” efforts etcetera. I sus­pect this com­mon view fails to ac­count for likely pivotal events (e.g. some ad­vances in nar­row AI that might sud­denly al­low any­one with suffi­cient com­pu­ta­tion power to carry out large scale ter­ror at­tacks). I think Allan Dafoe’s re­search agenda for AI Gover­nance is an ex­tremely im­por­tant and ne­glected land­scape that we (the AI safety com­mu­nity) should be look­ing at to im­prove our pre­dic­tions and strate­gies.

• Gen­er­ally, I don’t see why we should ex­pect that the most ca­pa­ble sys­tems that can be cre­ated with su­per­vised learn­ing (e.g. by us­ing RL to search over an ar­bi­trary space of NN ar­chi­tec­tures) would perform similarly to the most ca­pa­ble sys­tems that can be cre­ated, at around the same time, us­ing some re­stricted su­per­vised learn­ing that hu­mans must trust to be safe. My prior is that the former is very likely to out­perform by a lot, and I’m not aware of strong ev­i­dence point­ing one way or an­other.

This seems similar to my view, which is that if you try to op­ti­mize for just one thing (effi­ciency) you’re prob­a­bly go­ing to end up with more of that thing than if you try to op­ti­mize for two things at the same time (effi­ciency and safety) or if you try to op­ti­mize for that thing un­der a heavy con­straint (i.e., safety).

But there are peo­ple (like Paul) who seem to be more op­ti­mistic than this based on more de­tailed in­side-view in­tu­itions, which makes me won­der if I should defer to them. If the an­swer is no, there’s also the ques­tion of how do we make policy mak­ers take this prob­lem se­ri­ously (i.e., that safe AI prob­a­bly won’t be as effi­cient as un­safe AI) given the ex­is­tence of more op­ti­mistic AI safety re­searchers, so that they’d be will­ing to un­der­take costly prepa­ra­tions for gov­er­nance solu­tions ahead of time. By the time we get con­clu­sive ev­i­dence one way or an­other, it may be too late to make such prepa­ra­tions.

• If the an­swer is no, there’s also the ques­tion of how do we make policy mak­ers take this prob­lem se­ri­ously (i.e., that safe AI prob­a­bly won’t be as effi­cient as un­safe AI) given the ex­is­tence of more op­ti­mistic AI safety re­searchers (so that they’d be will­ing to un­der­take costly prepa­ra­tions for gov­er­nance solu­tions ahead of time).

I’m not aware of any AI safety re­searchers that are ex­tremely op­ti­mistic about solv­ing al­ign­ment com­pet­i­tively. I think most of them are just skep­ti­cal about the fea­si­bil­ity of gov­er­nance solu­tions, or think gov­er­nance re­lated in­ter­ven­tions might be nec­es­sary but shouldn’t be car­ried out yet.

In this 80,000 Hours pod­cast epi­sode, Paul said the fol­low­ing:

In terms of the ac­tual value of work­ing on AI safety, I think the biggest con­cern is this, “Is this an easy prob­lem that will get solved any­way?” Maybe the sec­ond biggest con­cern is, “Is this a prob­lem that’s so difficult that one shouldn’t bother work­ing on it or one should be as­sum­ing that we need some other ap­proach?” You could imag­ine, the tech­ni­cal prob­lem is hard enough that al­most all the bang is go­ing to come from policy solu­tions rather than from tech­ni­cal solu­tions.
And you could imag­ine, those two con­cerns maybe sound con­tra­dic­tory, but aren’t nec­es­sar­ily con­tra­dic­tory, be­cause you could say, “We have some un­cer­tainty about this pa­ram­e­ter of how hard this prob­lem is.” Either it’s go­ing to be easy enough that it’s solved any­way, or it’s go­ing to be hard enough that work­ing on it now isn’t go­ing to help that much and so what mostly mat­ters is get­ting our policy re­sponse in or­der. I think I don’t find that com­pel­ling, in part be­cause one, I think the sig­nifi­cant prob­a­bil­ity on the range … like the place in be­tween those, and two, I just think work­ing on this prob­lem ear­lier will tell us what’s go­ing on. If we’re in the world where you need a re­ally dras­tic policy re­sponse to cope with this prob­lem, then you want to know that as soon as pos­si­ble.
It’s not a good move to be like, “We’re not go­ing to work on this prob­lem be­cause if it’s se­ri­ous, we’re go­ing to have a dra­matic policy re­sponse.” Be­cause you want to work on it ear­lier, dis­cover that it seems re­ally hard and then have sig­nifi­cantly more mo­ti­va­tion for try­ing the kind of co­or­di­na­tion you’d need to get around it.
• I’m not aware of any AI safety re­searchers that are ex­tremely op­ti­mistic about solv­ing al­ign­ment com­pet­i­tively.

I’m not sure what you’d con­sider “ex­tremely” op­ti­mistic, but I gath­ered some quan­ti­ta­tive es­ti­mates of AI risk here, and they all seem overly op­ti­mistic to me. Did you see that?

Paul: I just think work­ing on this prob­lem ear­lier will tell us what’s go­ing on. If we’re in the world where you need a re­ally dras­tic policy re­sponse to cope with this prob­lem, then you want to know that as soon as pos­si­ble.

I agree with this mo­ti­va­tion to do early work, but in a world where we do need dras­tic policy re­sponses, I think it’s pretty likely that the early work won’t ac­tu­ally pro­duce con­clu­sive enough re­sults to show that. For ex­am­ple, if a safety ap­proach fails to make much progress, there’s not re­ally a good way to tell if it’s be­cause safe and com­pet­i­tive AI re­ally is just too hard (and there­fore we need a dras­tic policy re­sponse), or be­cause the ap­proach is wrong, or the peo­ple work­ing on it aren’t smart enough, or they’re try­ing to do the work too early. Peo­ple who are in­clined to be op­ti­mistic will prob­a­bly re­main so un­til it’s too late.

• but I gath­ered some quan­ti­ta­tive es­ti­mates of AI risk here, and they all seem overly op­ti­mistic to me. Did you see that?

I only now read that thread. I think it is ex­tremely worth­while to gather such es­ti­mates.

I think all the three es­ti­mates men­tioned there cor­re­spond to marginal prob­a­bil­ities (rather than prob­a­bil­ities con­di­tioned on “no gov­er­nance in­ter­ven­tions”). So those es­ti­mates already ac­count for sce­nar­ios in which gov­er­nance in­ter­ven­tions save the world. There­fore, it seems we should not strongly up­date against the ne­ces­sity of gov­er­nance in­ter­ven­tions due to those es­ti­mates be­ing op­ti­mistic.

Maybe we should gather re­searchers’ cre­dences for pre­dic­tions like:
”If there will be no gov­er­nance in­ter­ven­tions, com­pet­i­tive al­igned AIs will ex­ist in 10 years from now”.

I sus­pect that gath­er­ing such es­ti­mates from pub­li­cly available in­for­ma­tion might ex­pose us to a se­lec­tion bias, be­cause very pes­simistic es­ti­mates might be out­side the Over­ton win­dow (even for the EA/​AIS crowd). For ex­am­ple, if Robert Wiblin would have con­cluded that an AI ex­is­ten­tial catas­tro­phe is 50% likely, I’m not sure that the 80,000 Hours web­site (which tar­gets a large and mo­ti­va­tion­ally di­verse au­di­ence) would have pub­lished that es­ti­mate.

I agree with this mo­ti­va­tion to do early work, but in a world where we do need dras­tic policy re­sponses, I think it’s pretty likely that the early work won’t ac­tu­ally pro­duce con­clu­sive enough re­sults to show that. For ex­am­ple, if a safety ap­proach fails to make much progress, there’s not re­ally a good way to tell if it’s be­cause safe and com­pet­i­tive AI re­ally is just too hard (and there­fore we need a dras­tic policy re­sponse), or be­cause the ap­proach is wrong, or the peo­ple work­ing on it aren’t smart enough, or they’re try­ing to do the work too early.

I strongly agree with all of this.

• I think all the three es­ti­mates men­tioned there cor­re­spond to marginal prob­a­bil­ities (rather than prob­a­bil­ities con­di­tioned on “no gov­er­nance in­ter­ven­tions”). So those es­ti­mates already ac­count for sce­nar­ios in which gov­er­nance in­ter­ven­tions save the world. There­fore, it seems we should not strongly up­date against the ne­ces­sity of gov­er­nance in­ter­ven­tions due to those es­ti­mates be­ing optimistic

I nor­mally give ~50% as my prob­a­bil­ity we’d be fine with­out any kind of co­or­di­na­tion.

• Upvoted for giv­ing this num­ber, but what does it mean ex­actly? You ex­pect “50% fine” through all kinds of x-risk, as­sum­ing no co­or­di­na­tion from now un­til the end of the uni­verse? Or just as­sum­ing no co­or­di­na­tion un­til AGI? Is it just AI risk in­stead of all x-risk, or just risk from nar­row AI al­ign­ment? If “AI risk”, are you in­clud­ing risks from AI ex­ac­er­bat­ing hu­man safety prob­lems, or AI differ­en­tially ac­cel­er­at­ing dan­ger­ous tech­nolo­gies? Is it 50% prob­a­bil­ity that hu­man­ity sur­vives (which might be “fine” to some peo­ple) or 50% that we end up with a nearly op­ti­mal uni­verse? Do you have a doc­u­ment that gives all of your quan­ti­ta­tive risk es­ti­mates with clear ex­pla­na­tions of what they mean?

(Sorry to put you on the spot here when I haven’t pro­duced any­thing like that my­self, but I just want to con­vey how con­fus­ing all this is.)

• MCTS works as am­plifi­ca­tion be­cause you can eval­u­ate fu­ture board po­si­tions to get a con­ver­gent es­ti­mate of how well you’re do­ing—and then even­tu­ally some­one ac­tu­ally wins the game, which keeps p from de­part­ing re­al­ity en­tirely. Im­por­tantly, the sin­gle thing you’re learn­ing can play the role of the en­vi­ron­ment, too, by pick­ing the op­po­nents’ moves.

In try­ing to train A to pre­dict hu­man ac­tions given ac­cess to A, you’re al­most do­ing some­thing similar. You have a pre­dic­tion that’s also sup­posed to be a pre­dic­tion of the en­vi­ron­ment (the hu­man), so you can use it for both sides of a tree search. But A isn’t ac­tu­ally search­ing through an in­ter­est­ing tree—it’s search­ing for cy­cles of length 1 in its own model of the en­vi­ron­ment, with no par­tic­u­lar guaran­tee that any cy­cles of length 1 ex­ist or are a good idea. “Tree search” in this con­text (I think) means spray­ing out a bunch of out­puts and hop­ing at least one falls into a fixed point upon iter­a­tion.

EDIT: Big oops, I didn’t ac­tu­ally un­der­stand what was be­ing talked about here.

• I agree there is a real sense in which AGZ is “bet­ter-grounded” (and more likely to be sta­ble) than iter­ated am­plifi­ca­tion in gen­eral. (This was some of the mo­ti­va­tion for the ex­per­i­ments here.)

• Oh, I’ve just re­al­ized that the “tree” was always in­tended to be some­thing like task de­com­po­si­tion. Sorry about that—that makes the anal­ogy a lot tighter.

• Isn’t A also grounded in re­al­ity by even­tu­ally giv­ing no A to con­sult with?

• This is true when get­ting train­ing data, but I think it’s a differ­ence be­tween A (or HCH) and AlphaGo Zero when do­ing simu­la­tion /​ am­plifi­ca­tion. Some­one wins a simu­lated game of Go even if both play­ers are mak­ing bad moves (or even ran­dom moves), which gives you a sig­nal that A doesn’t have ac­cess to.

• I don’t sup­pose you could ex­plain how it uses P and V? Does it use P to de­cide which path to go down and V to avoid fully play­ing it out?

• How do you know MCTS doesn’t pre­serve al­ign­ment?

• As I un­der­stand it—MCTS is used to max­i­mize a given com­putable util­ity func­tion, and so it is non al­ign­ment-pre­serv­ing in the gen­eral sense that a suffi­ciently strong op­ti­miza­tion of a non-perfect util­ity func­tion is non al­ign­ment-pre­serv­ing.