Alignment as Translation

Tech­nol­ogy Changes Con­straints ar­gues that eco­nomic con­straints are usu­ally mod­u­lar with re­spect to tech­nol­ogy changes—so for rea­son­ing about tech­nol­ogy changes, it’s use­ful to cast them in terms of eco­nomic con­straints. Two con­straints we’ll talk about here:

  • Com­pute—flops, mem­ory, etc.

  • In­for­ma­tion—sen­sors, data, etc.

Thanks to on­go­ing tech­nol­ogy changes, both of these con­straints are be­com­ing more and more slack over time—com­pute and in­for­ma­tion are both in­creas­ingly abun­dant and cheap.

Im­me­di­ate ques­tion: what hap­pens in the limit as the prices of both com­pute and in­for­ma­tion go to zero?

Essen­tially, we get om­ni­science: our soft­ware has ac­cess to a perfect, micro­scop­i­cally-de­tailed model of the real world. Com­put­ers have the mem­ory and pro­cess­ing ca­pa­bil­ity to run ar­bi­trary queries on that model, and pre­dic­tions are near-perfectly ac­cu­rate (mod­ulo quan­tum noise). This limit ap­plies even with­out AGI—as com­pute and in­for­ma­tion be­come more abun­dant, our soft­ware ap­proaches om­ni­science, even limit­ing our­selves to spe­cial-pur­pose rea­son­ing al­gorithms.

Of course, AGI would pre­sum­ably be closer to om­ni­science than non-AGI al­gorithms, at the same level of com­pute/​in­for­ma­tion. It would be able to more ac­cu­rately pre­dict more things which aren’t di­rectly ob­serv­able via available sen­sors, and it would be able to run larger queries with the same amount of com­pute. (How much closer to om­ni­science an AGI would get is an open ques­tion, but it would at least not be any worse in a big-O sense.)

Next ques­tion: as com­pute and in­for­ma­tion con­straints slacken, which con­straints be­come taut? What new bot­tle­necks ap­pear, for prob­lems which were pre­vi­ously bot­tle­necked on com­pute/​in­for­ma­tion?

To put it differ­ently: if our soft­ware can run ar­bi­trary queries on an ac­cu­rate, ar­bi­trar­ily pre­cise low-level model of the phys­i­cal world, what else do we need in or­der to get value out of that ca­pa­bil­ity?

Well, mainly we need some way to spec­ify what it is that we want. We need an in­ter­face.

Our highly ac­cu­rate low-level world model can tell us any­thing about the phys­i­cal world, but the things-we-want are gen­er­ally more ab­stract than molecules/​atoms/​fields. Our soft­ware can have ar­bi­trar­ily pre­cise knowl­edge and pre­dic­tive power on phys­i­cal ob­serv­ables, but it still won’t have any no­tion that air-pres­sure-os­cilla­tions which sound like the word “cat” have some­thing to do with the or­gans/​cells/​bio­molecules which com­prise a cat. It won’t have built-in any no­tion of “tree” or “rock” or “hu­man”—us­ing such high-level ab­strac­tions would only im­pede pre­dic­tive power, when we could in­stead model the in­di­vi­d­ual com­po­nents of such high-level ob­jects.

It’s the pro­to­typ­i­cal in­ter­face prob­lem: the struc­ture of a high-pre­ci­sion world-model gen­er­ally does not match the struc­ture of what-hu­mans-want, or the struc­ture of hu­man ab­strac­tions in gen­eral. Some­one/​some­thing has to trans­late be­tween the struc­tures in or­der to pro­duce any­thing use­ful.

As I see it, this is the cen­tral prob­lem of al­ign­ment.

Some Approaches

De­fault: Hu­mans Translate

Without some scal­able way to build high-level world mod­els out of low-level world mod­els, we con­stantly need to man­u­ally trans­late things-hu­mans-want into low-level speci­fi­ca­tions. It’s an in­tel­lec­tual-la­bor-in­ten­sive and er­ror-prone pro­cess; writ­ing pro­grams in as­sem­bly code is not just an anal­ogy but an ex­am­ple. Even to­day’s “high-level pro­gram­ming lan­guages” are much more struc­turally similar to as­sem­bly code than to hu­man world-mod­els—Python has no no­tion of “oak tree”.

An anal­ogy: trans­lat­ing high-level struc­ture into low-level speci­fi­ca­tion the way we do to­day is like trans­lat­ing English into Korean by hand.

Hu­mans Trans­late Us­ing Bet­ter Tools

It’s plau­si­ble (though I find it un­likely) that we could tackle the prob­lem by build­ing bet­ter tools to help hu­mans trans­late from high-level to low-level—some­thing like much-higher-level pro­gram­ming lan­guages. I find it un­likely be­cause we’d prob­a­bly need ma­jor the­o­ret­i­cal break­throughs—for in­stance, how do I for­mally define “tree” in terms of low-level ob­serv­ables? Even if we had ways to do that, they’d prob­a­bly en­able eas­ier strate­gies than build­ing bet­ter pro­gram­ming lan­guages.

Anal­ogy: it’s like trans­lat­ing by hand from English to Korean, but with the as­sis­tance of a dic­tio­nary, spell-checker, gram­mar-checker, etc. But if we had an English-Korean dic­tio­nary, we’d prob­a­bly be most of the way to au­to­mated trans­la­tion any­way (in this re­spect, the anal­ogy is im­perfect).

Ex­am­ples + Interpolation

Another path which is plau­si­ble (though I find it un­likely) is some­thing like pro­gram­ming-by-ex­am­ple—not un­like to­day’s ML. This seems un­likely to work from both an in­side and out­side view:

  • In­side view: the whole prob­lem in the first place is that low-level struc­ture doesn’t match high-level struc­ture, so there’s no rea­son to ex­pect soft­ware sys­tems to in­ter­po­late along hu­man-in­tu­itive di­men­sions.

  • Out­side view: pro­gram­ming-by-ex­am­ple (and to­day’s ML with it) is no­to­ri­ously un­re­li­able.

Ex­am­ples alone aren’t enough to make soft­ware re­li­ably carve re­al­ity at the same joints as hu­mans. There prob­a­bly are some ar­chi­tec­tures which would re­li­ably carve at the same joints as hu­mans—differ­ent hu­mans tend to chunk the world into similar ob­jects, af­ter all. But figur­ing out such an ar­chi­tec­ture would take more than just throw­ing lots of data at the prob­lem.

To put it differ­ently: the way-in-which-we-want-things-trans­lated is it­self some­thing which needs to be trans­lated. A hu­man’s idea-of-what-con­sti­tutes-a-“good”-low-level-speci­fi­ca­tion-of-“oak tree” is it­self pretty high-level and ab­stract; that idea it­self needs to be trans­lated into a low-level speci­fi­ca­tion be­fore it can be used. If we’re try­ing to use ex­am­ples+in­ter­po­la­tion, then the in­ter­po­la­tion al­gorithm is our “speci­fi­ca­tion” of how-to-trans­late… and it prob­a­bly isn’t a very good trans­la­tion of our ac­tual high-level idea of how-to-trans­late.

Anal­ogy: it’s like teach­ing English to Korean speak­ers by point­ing to trees and say­ing “tree”, point­ing to cars and say­ing “car”, etc… ex­cept that none of them ac­tu­ally re­al­ize they’re sup­posed to be learn­ing an­other lan­guage. The Korean-lan­guage in­struc­tions they re­ceived were not ac­tu­ally a trans­la­tion of the English ex­pla­na­tion “learn the lan­guage that per­son is speak­ing”.


A small tweak to the pre­vi­ous ap­proach: train a re­in­force­ment learner.

The anal­ogy: rather than giv­ing our Korean-speak­ers some ran­dom Korean-lan­guage in­struc­tions, we don’t give them any in­struc­tions—we just let them try things, and then pay them when they hap­pen to trans­late things from English to Korean.

Prob­lem: this re­quires some way to check that the trans­la­tion was cor­rect. Know­ing what to in­cen­tivize is not any eas­ier than spec­i­fy­ing what-we-want to be­gin with. Rather than trans­lat­ing English-to-Korean, we’re trans­lat­ing English-to-in­cen­tives.

Now, there is a lot of room here for clever tricks. What if we ver­ify the trans­la­tion by hav­ing one group trans­late English-to-Korean, an­other group trans­late back, and re­ward both when the re­sult matches the origi­nal? Or tak­ing the Korean trans­la­tion, giv­ing it to some other Korean speaker, and see­ing what they do? Etc. Th­ese are pos­si­ble ap­proaches to trans­lat­ing English into in­cen­tives, within the con­text of the anal­ogy.

It’s pos­si­ble in prin­ci­ple that trans­lat­ing what-hu­mans-want into in­cen­tives is eas­ier than trans­lat­ing into low-level speci­fi­ca­tions di­rectly. How­ever, if that’s the case, I have yet to see com­pel­ling ev­i­dence—at­tempts to spec­ify in­cen­tives seem plagued by the same sur­pris­ing cor­ner cases and long tail of difficult trans­la­tions as other strate­gies.

AI Translates

This brings us to the ob­vi­ous gen­eral an­swer: have the AI han­dle the trans­la­tion from high-level struc­ture to low-level struc­ture. This is prob­a­bly what will hap­pen even­tu­ally, but the pre­vi­ous ex­am­ples should make it clear why it’s hard: an ex­pla­na­tion of how-to-trans­late must it­self be trans­lated. In or­der to make an AI which trans­lates high-level things-hu­mans-want into low-level speci­fi­ca­tions, we first need a low-level speci­fi­ca­tion of the high-level con­cept “trans­late high-level things-hu­mans-want into low-level speci­fi­ca­tions”.

Con­tin­u­ing the ear­lier anal­ogy: we’re try­ing to teach English to a Korean speaker, but that Korean speaker doesn’t have any idea that they’re sup­posed to be learn­ing an­other lan­guage. In or­der to get them to learn English, we first need to some­how trans­late some­thing like “please learn this lan­guage”.

This is a sig­nifi­cant re­duc­tion of the prob­lem: rather than trans­lat­ing ev­ery­thing by hand all the time, we just need to trans­late the one phrase “please learn this lan­guage”, and then the hard part is done and we can just use lots of ex­am­ples for the rest.

But we do have a chicken-and-egg prob­lem: some­how, we need to prop­erly trans­late that first phrase. Screw up that first trans­la­tion, and noth­ing else will work. That part can­not be out­sourced; the AI can­not han­dle the trans­la­tion be­cause it has no idea that that’s what we want it to do.