# Charlie Steiner

Karma: 1,436
Page 1
• To elab­o­rate, A->B is an op­er­a­tion with a truth table:

A    B    A->B
T    T    T
T    F    F
F    T    T
F    F    T


The only thing that falsifies A->B is if A is true but B is false. This is differ­ent from how we usu­ally think about im­pli­ca­tion, be­cause it’s not like there’s any re­quire­ment that you can de­duce B from A. It’s just a truth table.

But it is rele­vant to prob­a­bil­ity, be­cause if A->B, then you’re not al­lowed to as­sign high prob­a­bil­ity to A but low prob­a­bil­ity to B.

EDIT: Any­how I think that para­graph is a re­ally quick and dirty way of phras­ing the in­com­pat­i­bil­ity of log­i­cal un­cer­tainty with nor­mal prob­a­bil­ity. The is­sue is that in nor­mal prob­a­bil­ity, log­i­cal steps are things that are al­lowed to hap­pen in­side the paren­the­ses of the P() func­tion. No mat­ter how com­pli­cated the proof of φ, as long as the proof fol­lows log­i­cally from premises, you can’t doubt φ more than you doubt the premises, be­cause the P() func­tion thinks that P(premises) and P(log­i­cal equiv­a­lent of premises ac­cord­ing to Boolean alge­bra) are “the same thing.”

• Ah, MIRI sum­mer fel­lows! Maybe that’s why there’s so many posts to­day.

I think that if there’s a di­chotomy, it’s “ab­stract/​ideal agents” vs. “phys­i­cal ‘agents’”.

Phys­i­cal agents, like hu­mans, don’t have to be any­thing like agent clusters—there doesn’t have to be any ideal agent hid­ing in­side them. In­stead, think­ing about them as agents is a de­scrip­tive step taken by us, the peo­ple mod­el­ing them. The key philo­soph­i­cal tech­nol­ogy is the in­ten­tional stance.

On to the meat of the post—agents are already very gen­eral, es­pe­cially if you al­low prefer­ences over world-his­to­ries, at which point they be­come re­ally gen­eral. Maybe it makes more sense to think of these things as lan­guages in which some things are sim­ple and oth­ers are com­pli­cated? At which point I think you have a straight­for­ward dis­tance func­tion be­tween lan­guages (how sur­pris­ing is one lan­guage one av­er­age to an­other), but no sense of equiv­alency aside from iden­ti­cal rank­ings.

• Con­sider the Sphex wasp, do­ing the same thing in re­sponse to the same stim­u­lus. Would you say that this is not an agent, or would you say that it is part of an agent, and that ex­tended agent did search in a “world model” in­stan­ti­ated in the parts of the world in­hab­ited by an­ces­tral wasps?

At this point, if you al­low “world model” to be liter­ally any­thing with mu­tual in­for­ma­tion in­clud­ing other macro­scopic situ­a­tions in the world, and “search” to be any pro­cess that gives you in­for­ma­tion about out­comes, then yes, I think you can guaran­tee that, prob­a­bil­is­ti­cally, get­ting a spe­cific out­come re­quires in­for­ma­tion about that out­come (no free lunch), which im­plies “search” on a “world model.” As for goals, we can just ig­nore the ap­par­ent goals of the Sphex wasp and define a “real” agent (evolu­tion) to have a goal defined by what­ever in­for­ma­tive pro­cess was at work (sur­vival).

• Well, maybe I didn’t do a good job un­der­stand­ing your ques­tion :)

De­ci­sion pro­ce­dures that don’t re­turn an an­swer, or that fail to halt, for some of the “pos­si­ble” his­to­ries, seem like a pretty broad cat­e­gory. Ditto for de­ci­sion pro­ce­dures that always have an an­swer.

But I guess a lot of those de­ci­sion pro­ce­dures are bor­ing or dumb. So maybe you were think­ing about a ques­tion like “for suffi­ciently ‘good’ de­ci­sion the­o­ries, do they all end up spec­i­fy­ing re­sponses for all counter-log­i­cal his­to­ries, or do they leave free pa­ram­e­ters?”

Am I on the right track?

• Sure. On the one hand, xkcd. On the other hand, if it works for you, that’s great and ab­solutely use­ful progress.

I’m a lit­tle wor­ried about di­rect ap­pli­ca­bil­ity to RL be­cause the model is still not fully nat­u­ral­ized—ac­tions that af­fect goals are neatly la­beled and sep­a­rated rather than be­ing a messy sub­set of ac­tions that af­fect the world. I guess this an­other one of those cases where I think the “right” an­swer is “so­phis­ti­cated com­mon sense,” but an ad-hoc mostly-an­swer would still be use­ful con­cep­tual progress.

• The search thing is a lit­tle sub­tle. It’s not that search or op­ti­miza­tion is au­to­mat­i­cally dan­ger­ous—it’s that I think the dan­ger is that search can turn up ad­ver­sar­ial ex­am­ples /​ sur­pris­ing solu­tions.

I men­tioned how I think the par­tic­u­lar kind of idiot-proof­ness that nat­u­ral lan­guage pro­cess­ing might have is “won’t tell an idiot a plan to blow up the world if they ask for some­thing else.” Well, I think that as soon as the AI is do­ing a deep search through out­comes to figure out how to make Alzheimer’s go away, you lose a lot of that pro­tec­tion and I think the AI is back in the cat­e­gory of Or­a­cles that might tell an idiot a plan to blow up the world.

Go­ing be­yond hu­man knowledge

You make some good points about even a text-only AI hav­ing op­ti­miza­tion pres­sure to sur­pass hu­mans. But for the ex­am­ple “GPT-3” sys­tem, even if it in some sense “un­der­stood” the cure for Alzheimer’s, it still wouldn’t tell you the cure for Alzheimer’s in re­sponse to a prompt, be­cause it’s try­ing to find the con­tinu­a­tion of the prompt with high­est prob­a­bil­ity in the train­ing dis­tri­bu­tion.

The point isn’t about text vs. video. The point is about the limi­ta­tions of try­ing to learn the train­ing dis­tri­bu­tion.

To the ex­tent that un­der­stand­ing the world will help the AI learn the train­ing dis­tri­bu­tion, in the limit of su­per-duper-in­tel­li­gent AI it will un­der­stand more and more about the world. But it will filter that all through the in­tent to learn the train­ing dis­tri­bu­tion. For ex­am­ple, if hu­man text isn’t trust­wor­thy on a cer­tain topic, it will learn to not be trust­wor­thy on that topic ei­ther.

• Sure. In the case of Lin­coln, I would say the prob­lem is solved by mod­els even as clean as Pearl-ian causal net­works. But in math, there’s no prin­ci­pled causal net­work model of the­o­rems to sup­port coun­ter­fac­tual rea­son­ing as causal calcu­lus.

Of course, I more or less just think that we have an un­prin­ci­pled causal­ity-like view of math that we take when we think about math­e­mat­i­cal coun­ter­fac­tu­als, but it’s not clear that this is any help to MIRI un­der­stand­ing proof-based AI.

• I feel like this is prac­ti­cally a fre­quen­tist/​bayesian dis­agree­ment :D It seems “ob­vi­ous” to me that “If Lin­coln were not as­sas­si­nated, he would not have been im­peached” can be about the real Lin­coln as much as me say­ing “Lin­coln had a beard” is, be­cause both are state­ments made us­ing my model of the world about this thing I la­bel Lin­coln. No refer­ence class nec­es­sary.

• Hon­estly? I feel like this same set of prob­lems gets re-solved a lot. I’m wor­ried that it’s a sign of ill health for the field.

I think we un­der­stand cer­tain tech­ni­cal as­pects of cor­rigi­bil­ity (in­differ­ence and CIRL), but have hit a brick wall in cer­tain other as­pects (things that re­quire so­phis­ti­cated “com­mon sense” about AIs or hu­mans to im­ple­ment, philo­soph­i­cal prob­lems about how to get an AI to solve philo­soph­i­cal prob­lems). I think this is part of what leads to re-tread­ing old ground when new peo­ple (or a per­son want­ing to ap­ply a new tool) try to work on AI safety.

On the other hand, I’m not sure if we’ve ex­hausted Con­crete Prob­lems yet. Yes, the an­swer is of­ten “just have so­phis­ti­cated com­mon sense,” but I think the value is in ex­plor­ing the prob­lems and gen­er­at­ing el­e­gant solu­tions so that we can deepen our un­der­stand­ing of value func­tions and agent be­hav­ior (like TurnTrout’s work on low-im­pact agents). In fact, Tom’s a co-au­thor on a very good toy prob­lems pa­per, many of which re­quire similar sort of one-off solu­tions that still might ad­vance our tech­ni­cal un­der­stand­ing of agents.

• I think the most “na­tive” rep­re­sen­ta­tion of util­ity func­tions is ac­tu­ally as a func­tion from or­dered triples of out­comes to real num­bers. Rather than hav­ing an ar­bi­trary (af­fine sym­me­try break­ing) scale for strength of prefer­ence, set the scale of a prefer­ence by com­par­ing to a third pos­si­ble out­come.

The func­tion is the “how much bet­ter?” func­tion. Given pos­si­ble out­comes A, B, and X, how many times bet­ter is A (rel­a­tive to X) than B (rel­a­tive to X).

If A is choco­late cake, and B is ice cream, and X is go­ing hun­gry, maybe the choco­late cake prefer­ence is 1.25 times stronger, so the func­tion Bet­ter­ness(choco­late cake, ice cream, go­ing hun­gry) = 1.25.

This is the sort of prefer­ence that you would elicit from a gam­ble (at least from a ra­tio­nal agent, not nec­es­sar­ily from a hu­man). If I am in­differ­ent to a gam­ble with a prob­a­bil­ity 1 of ice cream, and a prob­a­bil­ity 0.8 of choco­late cake and 0.2 of go­ing hun­gry, this tells you that bet­ter­ness-value above.

Any­how, in­ter­est­ing post, I’m just idly com­ment­ing.

• This is definitely an in­ter­est­ing topic, and I’ll even­tu­ally write a re­lated post, but here are my thoughts at the mo­ment.

1 - I agree that us­ing nat­u­ral lan­guage prompts with sys­tems trained on nat­u­ral lan­guage makes for a much eas­ier time get­ting com­mon-sense an­swers. A par­tic­u­lar sort of idiot-proofing that pre­vents the hy­po­thet­i­cal idiot from hav­ing the AI tell them how to blow up the world. You use the ex­am­ple of “How would we be likely to cure Alzheimer’s?”—but for a well-trained nat­u­ral lan­guage Or­a­cle, you could even ask “How should we cure Alzheimer’s?”

If it was an out­come pump with no par­tic­u­lar knowl­edge of hu­mans, it would give you a plan that would set off our nu­clear ar­se­nals. A su­per­in­tel­li­gent search pro­cess with an im­pact penalty would tell you how to en­g­ineer a very un­ob­tru­sive virus. A perfect world model with no spe­cial knowl­edge of hu­mans would tell you a se­ries of con­figu­ra­tions of quan­tum fields. Th­ese are all bad an­swers.

What you want the Or­a­cle to tell you is the sort of plan that might prac­ti­cally be car­ried out, or some other use­ful in­for­ma­tion, that leads to an Alheimer cure in the nor­mal way that peo­ple mean when talk­ing about dis­eases and re­search and cur­ing things. Any model that does a good job pre­dict­ing hu­man nat­u­ral lan­guage will take this sort of thing for granted in more or less the way you want it to.

2 - But here’s the prob­lem with cur­ing Alzheimer’s: it’s hard. If you train GPT-3 on a bunch of med­i­cal text­books and prompt it to tell you a cure for Alzheimer’s, it won’t tell you a cure, it will tell you what hu­mans have said about cur­ing Alzheimer’s.

If you train a si­mul­ta­neous model (like a neu­ral net or a big trans­former or some­thing) of hu­man words, plus sen­sor data of the sur­round­ing en­vi­ron­ment (like how an image cap­tion­ing ai can be thought of as hav­ing a si­mul­ta­neous model of words and pic­tures), and figure out how to con­trol the amount of de­tail of ver­bal out­put, you might be able to prompt an AI with text about an Alzheimer’s cure, have it model a phys­i­cal en­vi­ron­ment that it ex­pects those words to take place in, and then trans­late that back into text de­scribing the pre­dicted en­vi­ron­ment in de­tail. But it still wouldn’t tell you a cure. It would just tell you a plau­si­ble story about a situ­a­tion re­lated to the prompt about cur­ing Alzheimer’s, based on its train­ing data. Rather than a log­i­cal Or­a­cle, this image-cap­tion­ing-es­que scheme would be an in­tu­itive Or­a­cle, tel­ling you things that make sense based on as­so­ci­a­tions already pre­sent within the train­ing set.

What am I driv­ing at here, by point­ing out that cur­ing Alzheimer’s is hard? It’s that the de­signs above are miss­ing some­thing, and what they’re miss­ing is search.

I’m not say­ing that get­ting a neu­ral net to di­rectly out­put your cure for Alzheimer’s is im­pos­si­ble. But it seems like it re­quires there to already be a “cure for Alzheimer’s” di­men­sion in your learned model. The more re­al­is­tic way to find the cure for Alzheimer’s, if you don’t already know it, is go­ing to in­volve lots of log­i­cal steps one af­ter an­other, slowly mov­ing through a log­i­cal space, nar­row­ing down the pos­si­bil­ities more and more, and even­tu­ally find­ing some­thing that fits the bill. In other words, solv­ing a search prob­lem.

So if your AI can tell you how to cure Alzheimer’s, I think ei­ther it’s ex­plic­itly do­ing a search for how to cure Alzheimer’s (or wor­lds that match your ver­bal prompt the best, or what­ever), or it has some in­ter­nal state that im­plic­itly performs a search.

And once you re­al­ize you’re imag­in­ing an AI that’s do­ing search, maybe you should feel a lit­tle less con­fi­dent in the idiot-proof­ness I talked about in sec­tion 1. Maybe you should be con­cerned that this search pro­cess might turn up the equiv­a­lent of ad­ver­sar­ial ex­am­ples in your rep­re­sen­ta­tion.

3 - When­ever I see a pro­posal for an Or­a­cle, I tend to try to jump to the end—can you use this Or­a­cle to im­me­di­ately con­struct a friendly AI? If not, why not?

A perfect Or­a­cle would, of course, im­me­di­ately give you FAI. You’d just ask it “what’s the code for a friendly AI?”, and it would tell you, and you would run it.

Can you do the same thing with this self-su­per­vised Or­a­cle you’re talk­ing about? Well, there might be some prob­lems.

One prob­lem is the search is­sue I just talked about—out­putting func­tion­ing code with a spe­cific pur­pose is a very search-y sort of thing to do, and not a very big-ol’-neu­ral-net thing to do, even moreso than out­putting a cure for Alzheimer’s. So maybe you don’t fully trust the out­put of this search, or maybe there’s no search and your AI is just in­ca­pable of do­ing the task.

But I think this is a bit of a dis­trac­tion, be­cause the ba­sic ques­tion is whether you trust this Or­a­cle with sim­ple ques­tions about moral­ity. If you think the AI is just re­gur­gi­tat­ing an av­er­age an­swer to trol­ley prob­lems or what­ever, should you trust it when you ask for the FAI’s code?

There’s an in­ter­est­ing case to be made for “yes, ac­tu­ally,” here, but I think most peo­ple will be a lit­tle wary. And this points to a more gen­eral prob­lem with defi­ni­tions—any time you care about get­ting a defi­ni­tion hav­ing some par­tic­u­larly nice prop­er­ties be­yond what’s most pre­dic­tive of the train­ing data, maybe you can’t trust this AI.

• That proof of the in­sta­bil­ity of RNNs is very nice.

The ver­sion of the van­ish­ing gra­di­ent prob­lem I learned is sim­ply that if you’re up­dat­ing weights pro­por­tional to the gra­di­ent, then if your av­er­age weight some­how ends up as 0.98, as you in­crease the num­ber of lay­ers your gra­di­ent, and there­fore your up­date size, will shrink kind of like (0.98)^n, which is not the be­hav­ior you want it to have.

• One suffi­cient con­di­tion for always defin­ing ac­tions is when a de­ci­sion the­ory can give de­ci­sions as a func­tion of the state of the world. For ex­am­ple, CDT eval­u­ates out­comes in a way purely de­pen­dent on the world’s state. A more com­pli­cated way of do­ing this is if your de­ci­sion the­ory takes in a model of the world and out­puts a policy, which tells you what to do in each state of the world.

• And of course you can go fur­ther and have differ­ent that all have similarly valid claims to be , be­cause they’re all similarly good gen­er­al­iza­tions of our be­hav­ior into a con­sis­tent func­tion on a much larger do­main.

• Yeah I agree that this might se­cretly be the same as a ques­tion about up­loads.

If you’re only try­ing to copy hu­man be­hav­ior in a coarse-grained way, you im­me­di­ately run into a huge gen­er­al­iza­tion prob­lem be­cause your hu­man-imi­ta­tion is go­ing to have to make plans where it can copy it­self, think faster as it adds more com­put­ing power, can’t get a hug, etc, and this is all out­side of the do­main it was trained on.

So if peo­ple aren’t be­ing very spe­cific about hu­man imi­ta­tions, I kind of as­sume they’re re­ally talk­ing and think­ing about ba­si­cally-up­loads (i.e. imi­ta­tions that gen­er­al­ize to this novel con­text by hav­ing a model of hu­man cog­ni­tion that at­tempts to be re­al­is­tic, not merely pre­dic­tive).

• Could you ex­pand on why you think that in­for­ma­tion /​ en­tropy doesn’t match what you mean by “amount of op­ti­miza­tion done”?

E.g. sup­pose you’re train­ing a neu­ral net­work via gra­di­ent de­scent. If you start with weights drawn from some broad dis­tri­bu­tion, af­ter train­ing they will end up in some nar­rower dis­tri­bu­tion. This seems like a good met­ric of “amount of op­ti­miza­tion done to the neu­ral net.”

I think there are two cat­e­gories of rea­sons why you might not be satis­fied—false pos­i­tive and false nega­tive. False pos­i­tives would be “I don’t think much op­ti­miza­tion has been done, but the dis­tri­bu­tion got a lot nar­rower,” and false nega­tives would be “I think more op­ti­miza­tion is hap­pen­ing, but the dis­tri­bu­tion isn’t get­ting any nar­rower.” Did you have a spe­cific in­stance of one of these cases in mind?

• Here’s a more gen­eral way of think­ing about what you’re say­ing that I find use­ful: It’s not that self-aware­ness is the is­sue per se, it’s that you’re avoid­ing build­ing an agent—by a spe­cific tech­ni­cal defi­ni­tion of “agent.”

Agents, in the sense I think is most use­ful when think­ing about AI, are things that choose ac­tions based on the pre­dicted con­se­quences of those ac­tions.

On some suit­ably ab­stract level of de­scrip­tion, agents must have available ac­tions, they must have some model of the world that in­cludes a free pa­ram­e­ter for differ­ent ac­tions, and they must have a crite­rion for choos­ing ac­tions that’s a func­tion of what the model pre­dicts will hap­pen when it takes those ac­tions. Agents are what is dan­ger­ous, be­cause they steer the fu­ture based on their crite­rion.

What you de­scribe in this post is an AI that has ac­tions (out­putting text to a text chan­nel), and has a model of the world. But maybe, you say, we can make it not an agent, and there­fore a lot less dan­ger­ous, by mak­ing it so that there is no free pa­ram­e­ter in the model for the agent to try out differ­ent ac­tions. and in­stead of choos­ing its ac­tion based on con­se­quences, it will just try to de­scribe what its model pre­dicts.

Think­ing about it in terms of agents like this ex­plains why “know­ing that it’s run­ning on a spe­cific com­puter” has the causal pow­ers that it does—it’s a func­tional sort of “know­ing” that in­volves hav­ing your model of the world im­pacted by your available ac­tions in a spe­cific way. Sim­ply putting “I am run­ning on this spe­cific com­puter” into its mem­ory wouldn’t make it an agent—and if it chooses what text to out­put based on pre­dicted con­se­quences, it’s an agent whether or not it has “I am run­ning on this spe­cific com­puter” in its mem­ory.

So, could this work? Yes. It would re­quire a lot of hard, hard work on the in­put/​out­put side, es­pe­cially if you want re­li­able nat­u­ral lan­guage in­ter­ac­tion with a model of the en­tire world, and you still have to worry about the in­ner op­ti­mizer prob­lem, par­tic­u­larly e.g. if you’re train­ing it in a way that cre­ates an in­cen­tive for self-fulfilling prophecy or some other im­plicit goal.

The ba­sic rea­son I’m pes­simistic about the gen­eral ap­proach of figur­ing out how to build safe non-agents is that agents are re­ally use­ful. If your AI de­sign re­quires a big pow­er­ful model of the en­tire world, that means that some­one is go­ing to build an agent us­ing that big pow­er­ful model very soon af­ter. Maybe this tool gives you some breath­ing room by helping sup­press com­peti­tors, or maybe it makes it eas­ier to figure out how to build safe agents. But it seems more likely to me that we’ll get a good out­come by just di­rectly figur­ing out how to build safe agents.

• “I don’t trust hu­mans to be a trusted source when it comes to what an AI should do with the fu­ture light­cone.”
First, let’s ac­knowl­edge that this is a new ob­jec­tion you are rais­ing which we haven’t dis­cussed yet, eh? I’m tempted to say “mov­ing the goal­posts”, but I want to hear your best ob­jec­tions wher­ever they come from; I just want you to ac­knowl­edge that this is in fact a new ob­jec­tion :)

Sure :) I’ve said similar things el­se­where, but I sup­pose one must some­times talk to peo­ple who haven’t read one’s ev­ery word :P

We’re be­ing pretty vague in de­scribing the hu­man-AI in­ter­ac­tion here, but I agree that one rea­son why the AI shouldn’t just do what it would pre­dict hu­mans would tell it to do (or, if be­low some thresh­old of cer­tainty, ask a hu­man) is that hu­mans are not im­mune to dis­tri­bu­tional shift.

There are also sys­tem­atic fac­tors, like pre­serv­ing your self-image, that some­times make hu­mans say re­ally dumb things about far-off situ­a­tions be­cause of more im­me­di­ate con­cerns.

Lastly, figur­ing out what the AI should do with its re­sources is re­ally hard, and figur­ing out which to call “bet­ter” be­tween two com­pli­cated choices can be hard too, and hu­mans will some­times do badly at it. Worst case, the hu­mans ap­pear to an­swer hard ques­tions with cer­tainty, or con­versely the ques­tions the AI is most un­cer­tain about slowly de­volve into giv­ing hu­mans hard ques­tions and treat­ing their an­swers as strong in­for­ma­tion.

I think the AI should ac­tively take this stuff into ac­count rather than try­ing to stay in some con­text where it can un­shake­ably trust hu­mans. And by “take this into ac­count,” I’m pretty sure that means model the hu­man and treat prefer­ences as ob­jects in the model.

Skip­ping over the in­ter­ven­ing stuff I agree with, here’s that Eliezer quote:

Eliezer Yud­kowsky wrote: “If the sub­ject is Paul Chris­ti­ano, or Carl Shul­man, I for one am will­ing to say these hu­mans are rea­son­ably al­igned; and I’m pretty much okay with some­body giv­ing them the keys to the uni­verse in ex­pec­ta­tion that the keys will later be handed back.”
Do you agree or dis­agree with Eliezer? (In other words, do you think a high-fidelity up­load of a benev­olent per­son will re­sult in a good out­come?)
If you dis­agree, it seems that we have no hope at suc­cess what­so­ever. If no hu­man can be trusted to act, and AGI is go­ing to arise through our ac­tions, then we can’t be trusted to build it right. So we might as well just give up now.

I think Upload Paul Chris­ti­ano would just go on to work on the al­ign­ment prob­lem, which might be use­ful but is definitely pass­ing the buck.

Though I’m not sure. Maybe Upload Paul Chris­ti­ano would be ca­pa­ble of tak­ing over the world and han­dling ex­is­ten­tial threats be­fore swiftly solv­ing the al­ign­ment prob­lem. Then it doesn’t re­ally mat­ter if it’s pass­ing the buck or not.

But my origi­nal thought wasn’t about up­loads (though that’s definitely a rea­son­able way to in­ter­pret my sen­tence), it was about copy­ing hu­man de­ci­sion-mak­ing be­hav­ior in the same sense that an image clas­sifier copies hu­man image-clas­sify­ing be­hav­ior.

Though maybe you went in the right di­rec­tion any­how, and if all you had was su­per­vised learn­ing the right thing to do is to try to copy the de­ci­sion-mak­ing of a sin­gle per­son (not an up­load, a side­load). What was that Greg Egan book—Zen­degi?

so far, it hasn’t re­ally proven use­ful to de­velop meth­ods to gen­er­al­ize speci­fi­cally in the case where we are learn­ing hu­man prefer­ences. We haven’t re­ally needed to de­velop spe­cial meth­ods to solve this spe­cific type of prob­lem. (Cor­rect me if I’m wrong.)

There are some cases where the AI speci­fi­cally has a model of the hu­man, and I’d call those “spe­cial meth­ods.” Not just IRL, the en­tire prob­lem of imi­ta­tion learn­ing of­ten uses spe­cific meth­ods to model hu­mans, like “value iter­a­tion net­works.” This is the sort of de­vel­op­ment I’m think­ing of that helps AI do a bet­ter job at gen­er­al­iz­ing hu­man val­ues—I’m not sure if you meant things at a lower level, like us­ing a differ­ent gra­di­ent de­scent op­ti­miza­tion al­gorithm.

• Ah, but I don’t trust hu­mans to be a trusted source when it comes to what an AI should do with the fu­ture light­cone. I ex­pect you’d run into some­thing like Scott talks about in The Tails Com­ing Apart As Me­taphor For Life, where hu­mans are mak­ing un­prin­ci­pled and con­tra­dic­tory state­ments, with not at all enough time spent think­ing about the prob­lem.

As Ian Good­fel­low puts it, ma­chine learn­ing peo­ple have already been work­ing on al­ign­ment for decades. If al­ign­ment is “learn­ing and re­spect­ing hu­man prefer­ences”, ob­ject recog­ni­tion is “hu­man prefer­ences about how to cat­e­go­rize images”, and sen­ti­ment anal­y­sis is “hu­man prefer­ences about how to cat­e­go­rize sen­tences”

I some­what agree, but you could equally well call them “learn­ing hu­man be­hav­ior at cat­e­go­riz­ing images,” “learn­ing hu­man be­hav­ior at cat­e­go­riz­ing sen­tences,” etc. I don’t think that’s enough. If we build an AI that does ex­actly what a hu­man would do in that situ­a­tion (or what ac­tion they would choose as cor­rect when as­sem­bling a train­ing set), I would con­sider that a failure.

So this is two sep­a­rate prob­lems: one, I think hu­mans can’t re­li­ably tell an AI what they value through a text chan­nel, even with prompt­ing, and two, I think that mimick­ing hu­man be­hav­ior, even hu­man be­hav­ior on moral ques­tions, is in­suffi­cient to deal with the pos­si­bil­ities of the fu­ture.

I’ve never heard any­one in ma­chine learn­ing di­vide the field into cases where we’re try­ing to gen­er­al­ize about hu­man val­ues and cases where we aren’t. It seems like the same set of al­gorithms, tricks, etc. work ei­ther way.

It also sounds silly to say that one can di­vide the field into cases where you’re do­ing model-based re­in­force­ment learn­ing, and cases where you aren’t. The point isn’t the di­vi­sion, it’s that model-based re­in­force­ment learn­ing is solv­ing a spe­cific type of prob­lem.

Let me take an­other go at the dis­tinc­tion: Sup­pose you have a big train­ing set of hu­man an­swers to moral ques­tions. There are sev­eral differ­ent things you could mean by “gen­er­al­ize well” in this case, which cor­re­spond to solv­ing differ­ent prob­lems.

The first kind of “gen­er­al­ize well” is where the task is to pre­dict moral an­swers drawn from the same dis­tri­bu­tion as the train­ing set. This is what most of the field is do­ing right now for Ian Good­fel­low’s ex­am­ples of cat­e­go­riz­ing images or cat­e­go­riz­ing sen­tences. The bet­ter we get at gen­er­al­iz­ing in this sense, the more re­pro­duc­ing the train­ing set cor­re­sponds to re­pro­duc­ing the test set.

Another sort of “gen­er­al­ize well” might be in­fer­ring a larger “real world” dis­tri­bu­tion even when the train­ing set is limited. For ex­am­ple, if you’re given la­beled data for hand­writ­ten digits 0-20 into bi­nary out­puts, can you give the cor­rect bi­nary out­put for 21? How about 33? In our moral ques­tions ex­am­ple, this would be like pre­dict­ing an­swers to moral ques­tions spawned by novel situ­a­tions not seen in train­ing. The bet­ter we get at gen­er­al­iz­ing in this sense, the more re­pro­duc­ing the train­ing set cor­re­sponds to re­pro­duc­ing ex­am­ples later drawn from the real world.

Let’s stop here for a mo­ment and point out that if we want gen­er­al­iza­tion in the sec­ond sense, al­gorith­mic ad­vances in the first sense might be use­ful, but they aren’t suffi­cient. For the clas­sifier to out­put the bi­nary for 33, it prob­a­bly has to be de­liber­ately de­signed to learn flex­ible rep­re­sen­ta­tions, and prob­a­bly get fed some ad­di­tional in­for­ma­tion (e.g. by trans­fer learn­ing). When the train­ing dis­tri­bu­tion and the “real world” dis­tri­bu­tion are differ­ent, you’re solv­ing a differ­ent prob­lem than when they’re the same.

A third sort of “gen­er­al­ize well” is to learn su­per­hu­manly skil­led an­swers even if the train­ing data is flawed or limited. Think of an agent that learns to play Atari games at a su­per­hu­man level, from hu­man demon­stra­tions. This gen­er­al­iza­tion task of­ten in­volves filling in a com­plex model of the hu­man “ex­pert,” along with learn­ing about the en­vi­ron­ment—for cur­rent ex­am­ples, the model of the hu­man is usu­ally hand-writ­ten. The bet­ter we get at gen­er­al­iz­ing in this way, the more the AI’s an­swers will be like “what we meant” (ei­ther by some met­ric we kept hid­den from the AI, or in some vague in­tu­itive sense) even if they di­verge from what hu­mans would an­swer.

(I’m sure there are more tasks that fall un­der the um­brella of “gen­er­al­iza­tion,” but you’ll have to sug­gest them your­self :) )

So while I’d say that value learn­ing in­volves gen­er­al­iza­tion, I think that gen­er­al­iza­tion can mean a lot of differ­ent tasks—a ris­ing tide of type 1 gen­er­al­iza­tion (which is the math­e­mat­i­cally sim­ple kind) won’t lift all boats.

• Yes, I agree that gen­er­al­iza­tion is im­por­tant. But I think it’s a bit too re­duc­tive to think of gen­er­al­iza­tion abil­ity as purely a func­tion of the al­gorithm.

For ex­am­ple, an image-recog­ni­tion al­gorithm trained with dropout gen­er­al­izes bet­ter, be­cause dropout acts like an ex­tra goal tel­ling the train­ing pro­cess to search for cat­e­gory bound­aries that are smooth in a cer­tain sense. And the rea­son we ex­pect that to work is be­cause we know that the cat­e­gory bound­aries we’re look­ing for are in fact usu­ally smooth in that sense.

So it’s not like dropout is a magic al­gorithm that vi­o­lates a no-free-lunch the­o­rem and ex­tracts gen­er­al­iza­tion power from nowhere. The power that it has comes from our knowl­edge about the world that we have en­coded into it.

(And there is a no free lunch the­o­rem here. How to gen­er­al­ize be­yond the train­ing data is not uniquely en­coded in the train­ing data, ev­ery bit of in­for­ma­tion in the gen­er­al­iza­tion pro­cess has to come from your model and train­ing pro­ce­dure.)

For value learn­ing, we want the AI to have a very spe­cific sort of gen­er­al­iza­tion skill when it comes to hu­mans. It has to not only pre­dict hu­man ac­tions, it has to make a very par­tic­u­lar sort of gen­er­al­iza­tion (“hu­man val­ues”), and sin­gle out part of that gen­er­al­iza­tion to make plans with. The in­for­ma­tion to pick out one par­tic­u­lar gen­er­al­iza­tion rather than an­other has to come from hu­mans do­ing hard, com­pli­cated work, even if it gets en­coded into the al­gorithm.