Problem relaxation as a tactic

It’s eas­ier to make your way to the su­per­mar­ket than it is to com­pute the fastest route, which is yet eas­ier than com­put­ing the fastest route for some­one run­ning back­wards and do­ing two and a half jump­ing jacks ev­ery five sec­onds and who only fol­lows the route per­cent of the time. Some­times, con­straints are nec­es­sary. Con­straints come with costs. Some­times, the costs are worth it.

Aspiring re­searchers try­ing to think about AI al­ign­ment might[1] have a failure mode which goes some­thing like… this:

Oh man, so we need to solve both outer and in­ner al­ign­ment to build a su­per­in­tel­li­gent agent which is com­pet­i­tive with un­al­igned ap­proaches and also doesn’t take much longer to train, and also we have to know this ahead of time. Maybe we could use some kind of pre­dic­tion of what peo­ple want… but wait, there’s also prob­lems with us­ing hu­man mod­els! How can it help peo­ple if it can’t model peo­ple? Ugh, and what about self-mod­ifi­ca­tion?! How is this agent even rea­son­ing about the uni­verse from in­side the uni­verse?

The as­piring re­searcher slumps in frus­tra­tion, mut­ters a curse un­der their breath, and hangs up their hat – “guess this whole al­ign­ment thing isn’t for me...”. And isn’t that so? All their brain could do was pat­tern-match onto already-pro­posed solu­tions and cached think­ing.

There’s more than one thing go­ing wrong here, but I’m just go­ing to fo­cus on one. Given that per­son’s un­der­stand­ing of AI al­ign­ment, this prob­lem is wildly over­con­strained. Whether or not al­ign­ment re­search is right for them, there’s just no way that any­one’s brain is go­ing to fulfill this in­sane solu­tion re­quest!

Some­times, con­straints are nec­es­sary. I think that the al­ign­ment com­mu­nity is pretty good at find­ing plau­si­bly nec­es­sary con­straints. Maybe some of the above aren’t nec­es­sary – maybe there’s One Clever Trick you come up with which ob­vi­ates one of these con­cerns.

Con­straints come with costs. Some­times, the costs are worth it. In this con­text, I think the costs are very much worth it. Un­der this im­plicit fram­ing of the prob­lem, you’re pretty hosed if you don’t get even outer al­ign­ment right.

How­ever, even if the real prob­lem has crazy con­straints, that doesn’t mean you should im­me­di­ately tackle the fully con­strained prob­lem. I think you should of­ten re­lax the prob­lem first: elimi­nate or weaken con­straints un­til you reach a prob­lem which is still a lit­tle con­fus­ing, but which you can get some trac­tion on.

Even if you know an un­bounded solu­tion to chess, you might still be 47 years away from a bounded solu­tion. But if you can’t state a pro­gram that solves the prob­lem in prin­ci­ple, you are in some sense con­fused about the na­ture of the cog­ni­tive work needed to solve the prob­lem. If you can’t even solve a prob­lem given in­finite com­put­ing power, you definitely can’t solve it us­ing bounded com­put­ing power. (Imag­ine Poe try­ing to write a chess-play­ing pro­gram be­fore he’d had the in­sight about search trees.)

~ The method­ol­ogy of un­bounded analysis

His­tor­i­cally, I tend to be too slow to re­lax re­search prob­lems. On the flip­side, all of my fa­vorite re­search ideas were di­rectly en­abled by prob­lem re­lax­ation. In­stead of just tel­ling you what to do and then hav­ing you for­get this ad­vice in five min­utes, I’m go­ing to paint it into your mind us­ing two sto­ries.

At­tain­able Utility Preservation

It’s spring of 2018, and I’ve writ­ten my­self into a cor­ner. My work with CHAI for that sum­mer was sup­posed to be on im­pact mea­sure­ment, but I in­con­ve­niently posted a con­vinc­ing-to-me ar­gu­ment that im­pact mea­sure­ment can­not ad­mit a clean solu­tion:

I want to pe­nal­ize the AI for hav­ing side effects on the world.[2] Sup­pose I have a func­tion which looks at the con­se­quences of the agent’s ac­tions and mag­i­cally re­turns all of the side effects. Even if you have this func­tion, you still have to as­sign blame for each effect – ei­ther the vase break­ing was the AI’s fault, or it wasn’t.

If the AI pe­nal­izes it­self for ev­ery­thing, it’ll try to stop peo­ple from break­ing vases – it’ll be clingy. But if you mag­i­cally have a model of how peo­ple are act­ing in the world, and the AI mag­i­cally only pe­nal­izes it­self for things which are its fault, then the AI is in­cen­tivized to black­mail peo­ple to break vases in ways which don’t tech­ni­cally count as its fault. Oops.

Sum­mer dawned, and I oc­cu­pied my­self with read­ing – lots and lots of read­ing. Even­tu­ally, enough was enough – I wanted to figure this out. I strode through my school’s library, mark­ers in my hand and de­ter­mi­na­tion in my heart. I was de­ter­mined not to leave be­fore un­der­stand­ing a) ex­actly why im­pact mea­sure­ment is im­pos­si­ble to solve cleanly, or b) how to solve it.

I reached the white­board, and then – with adrenal­ine pump­ing through my veins – I re­al­ized that I had no idea what this “im­pact” thing even is. Oops.

I’m star­ing at the white­board.

A minute passes.

59 more min­utes pass.

I’d been think­ing about how, in hind­sight, it was so im­por­tant that Shan­non had first writ­ten a perfect chess-play­ing al­gorithm which re­quired in­finite com­pute, that Hut­ter had writ­ten an AGI al­gorithm which re­quired in­finite com­pute. I didn’t know how to solve im­pact un­der all the con­straints, but what if I as­sumed some­thing here?

What if I had in­finite com­put­ing power? No… Still con­fused, don’t see how to do it. Oh yeah, and what if the AI had a perfect world model. Hm… What if we could write down a fully speci­fied util­ity func­tion which rep­re­sented hu­man prefer­ences? Could I mea­sure im­pact if I knew that?

The an­swer was al­most triv­ially ob­vi­ous. My first thought was that nega­tive im­pact would be a de­crease in true util­ity, but that wasn’t quite right. I re­al­ized that im­pact mea­sure needs to also cap­ture de­crease in abil­ity to achieve util­ity. That’s an op­ti­mal value func­tion… So the nega­tive im­pact would be the de­crease in at­tain­able util­ity for hu­man val­ues![3]

Okay, but we don’t and won’t know the “true” util­ity func­tion. What if… we just pe­nal­ized shift in all at­tain­able util­ities?

I then wrote down The At­tain­able Utility Preser­va­tion Equa­tion, more or less. Although it took me a few weeks to be­lieve and re­al­ize, that equa­tion solved all of the im­pact mea­sure­ment prob­lems which had seemed so in­sur­mountable to me just min­utes be­fore.[4]

For­mal­iz­ing In­stru­men­tal Convergence

It’s spring of 2019, and I’ve writ­ten my­self into a cor­ner. My first post on AUP was con­fus­ing – I’d failed to truly com­mu­ni­cate what I was try­ing to say. In­spired by Embed­ded Agency, I was plan­ning an illus­trated se­quence of my own.

I was work­ing through a bit of rea­son­ing on how your abil­ity to achieve one goal in­ter­acts with your abil­ity to achieve seem­ingly un­re­lated goals. Spend­ing a lot of money on red dice helps you for the col­lect­ing-dice goal, but makes it harder to be­come the best jug­gler in the world. That’s a weird fact, but it’s an im­por­tant fact which un­der­lies much of AUP’s em­piri­cal suc­cess. I didn’t un­der­stand why this fact was true.

At an im­promptu pre­sen­ta­tion in 2018, I’d re­marked that “AUP wields in­stru­men­tal con­ver­gence as a weapon against the al­ign­ment prob­lem it­self”. I tried think­ing about it us­ing the for­mal­isms of re­in­force­ment learn­ing. Sud­denly, I asked myself

Why is in­stru­men­tal con­ver­gence even a thing?

I paused. I went out­side for a walk, and I paced. The walk length­ened, and I still didn’t un­der­stand why. Maybe it was just a “brute fact”, an “emer­gent” phe­nomenon – nope, not buy­ing that. There’s an ex­pla­na­tion some­where.

I went back to the draw­ing board – to the white­board, in fact. I stopped try­ing to un­der­stand the gen­eral case and I fo­cused on spe­cific toy en­vi­ron­ments. I’m look­ing at an en­vi­ron­ment like this

and I’m think­ing, most agents go from 1 to 3. “Why does my brain think this?”, I asked my­self. Un­helpfully, my brain de­cided not to re­spond.

I’m star­ing at the white­board.

A minute passes.

29 more min­utes pass.

I’m re­minded of a pa­per my ad­vi­sor had me read for my qual­ify­ing exam. The pa­per talked about a dual for­mu­la­tion for re­in­force­ment learn­ing en­vi­ron­ments, where you con­sider the available tra­jec­to­ries through the fu­ture in­stead of the available poli­cies. I take a pic­ture of the white­board and head back to my office.

I run into a friend. We start talk­ing about work. I say, “I’m about 80% sure I have the in­sight I need – this is how I felt in the past in situ­a­tions like this, and I turned out to be right”.

I turned out to be right. I started build­ing up an en­tire the­ory of this dual for­mal­ism. In­stead of ask­ing my­self about the gen­eral case of in­stru­men­tal con­ver­gence in ar­bi­trary com­putable en­vi­ron­ments, I con­sid­ered small de­ter­minis­tic Markov de­ci­sion pro­cesses. I started prov­ing ev­ery­thing I could, build­ing up my un­der­stand­ing piece by piece. This turned out to make all differ­ence.

Half a year later, I’d built up enough the­ory that I was able to ex­plain a great deal (but not ev­ery­thing) about in­stru­men­tal con­ver­gence.


Prob­lem re­lax­ation isn’t always the right tac­tic. For ex­am­ple, if the prob­lem isn’t well-posed, it won’t work well – imag­ine try­ing to “re­lax” the “prob­lem” of free will! How­ever, I think it’s of­ten the right move.

The move it­self is sim­ple: con­sider the sim­plest in­stance of the prob­lem which is still con­fus­ing. Then, make a ton sim­plify­ing as­sump­tions while still keep­ing part of the difficulty pre­sent – don’t as­sume away all of the difficulty. Fi­nally, tackle the re­laxed prob­lem.

In gen­eral, this seems like a skill that suc­cess­ful re­searchers and math­e­mat­i­ci­ans learn to use. MIRI does a lot of this, for ex­am­ple. If you’re new to the re­search game, this might be one of the cru­cial things to pick up on. Even though I de­tailed how this has worked for me, I think I could benefit from re­lax­ing more.

The world is go­ing to hell. You might be work­ing on a hard (or even an im­pos­si­ble) prob­lem. We plau­si­bly stand on the precipice of ex­tinc­tion and ut­ter an­nihila­tion.

Just re­lax.

This is meant as a refer­ence post. I’m not the first to talk us­ing prob­lem re­lax­ation in this way. For ex­am­ple, see The method­ol­ogy of un­bounded anal­y­sis.

  1. This failure mode is just my best guess – I haven’t ac­tu­ally sur­veyed as­piring re­searchers. ↩︎

  2. The “con­vinc­ing-to-me ar­gu­ment” con­tains a lot of con­fused rea­son­ing about im­pact mea­sure­ment, of course. For one, think­ing about side effects is not a good way of con­cep­tu­al­iz­ing the im­pact mea­sure­ment prob­lem. ↩︎

  3. The ini­tial thought wasn’t as clear as “pe­nal­ize de­crease in at­tain­able util­ity for hu­man val­ues” – I was ini­tially quite con­fused by the AUP equa­tion. “What the heck is this equa­tion, and how do I break it?”.

    It took me a few weeks to get a han­dle for why it seemed to work so well. It wasn’t for a month or two that I be­gan to un­der­stand what was ac­tu­ally go­ing on, even­tu­ally lead­ing to the Refram­ing Im­pact se­quence. How­ever, for the reader’s con­ve­nience, I white­washed my rea­son­ing here a bit. ↩︎

  4. At first, I wasn’t very ex­cited about AUP – I was new to al­ign­ment, and it took a lot of ev­i­dence to over­come the prior im­prob­a­bil­ity of my hav­ing ac­tu­ally found some­thing to be ex­cited about. It took sev­eral weeks be­fore I stopped think­ing it likely that my idea was prob­a­bly se­cretly and hor­ribly bad.

    How­ever, I kept star­ing at the strange equa­tion – I kept try­ing to break it, to find some ob­vi­ous loop­hole which would send me back to the draw­ing board. I never found it. Look­ing back over a year later, AUP does presently have loop­holes, but they’re not ob­vi­ous, nor should they have sent me back to the draw­ing board.

    I started to get ex­cited about the idea. Two weeks later, my work­day was wrap­ping up and I left the library.

    Okay, I think there’s about a good chance that this ends up solv­ing im­pact. If I’m right, I’ll want to have a photo to com­mem­o­rate it.

    I turned heel, de­scend­ing back into the library’s base­ment. I took the pho­to­graph. I’m glad that I did.

    Dis­cov­er­ing AUP was one of the hap­piest mo­ments of my life. It gave me con­fi­dence that I could think, and it gave me some con­fi­dence that we can win – that we can solve al­ign­ment. ↩︎