• Your ar­gu­ment about Solomonoff in­duc­tion is clever but I feel like it’s miss­ing the point.

I agree it’s miss­ing the point. I do get the point, and I dis­agree with it—I wanted to say “all three cases will build self-mod­els”; I couldn’t be­cause that may not be true for Solomonoff in­duc­tion due to an un­re­lated rea­son which as you note misses the point. I did claim that the other two cases would be self-aware as you define it.

(I agree that Solomonoff in­duc­tion might build an ap­prox­i­mate model of it­self, idk.)

Maybe if we do it right, the best model would not be self-re­flec­tive, not know­ing what it was do­ing as it did its pre­dic­tive thing, and thus un­able to rea­son about its in­ter­nal pro­cesses or rec­og­nize causal con­nec­tions be­tween that and the world it sees (even if such con­nec­tions are blatant).

My claim is that we have no idea how to do this, and I think the ex­am­ples in your post would not do this.

One in­tu­ition is: An or­a­cle is sup­posed to just an­swer ques­tions. It’s not sup­posed to think through how its out­puts will ul­ti­mately af­fect the world. So, one way of en­sur­ing that it does what it’s sup­posed to do, is to de­sign the or­a­cle to not know that it is a thing that can af­fect the world.

I’m not dis­agree­ing that if we could build a self-un­aware or­a­cle then we would be safe. That seems rea­son­ably likely to fix agency is­sues (though I’d want to think about it more). My dis­agree­ment is on the premise of the ar­gu­ment, i.e. can we build self-un­aware or­a­cles at all.

• Ex­am­ple of a self-aware sys­tem: A tra­di­tional RL agent. (Why? Be­cause it has a spe­cial con­cept of “its own ac­tions” rep­re­sented in its mod­els.)
Ex­am­ple of a self-un­aware sys­tem: Any sys­tem that takes in­puts, does a de­ter­minis­tic com­pu­ta­tion, and spits out an out­put. (Why? Be­cause when you cor­rectly com­pute a com­putable func­tion, you get the same an­swer re­gard­less of where and whether the com­pu­ta­tion is phys­i­cally in­stan­ti­ated in the uni­verse.)

A tra­di­tional RL agent is ab­solutely a de­ter­minis­tic com­pu­ta­tion (mod­ulo bugs in the code). It is a pro­gram that gets com­piled into or run by ma­chine in­struc­tions which fol­low a par­tic­u­lar de­ter­minis­tic speci­fi­ca­tion that In­tel has writ­ten (if you’re run­ning on In­tel chips).

Also, this ar­gu­ment would prove that hu­mans are not de­ter­minis­tic sys­tems, which seems like it’s prov­ing too much.

If you want to pre­dict what’s go­ing to hap­pen in the world, it of­ten helps if you know that you are a thing that af­fects the world.

For your three ex­am­ples: The Solomonoff in­duc­tion ex­am­ple is a weird case be­cause it is an un­com­putable al­gorithm that only has com­putable hy­pothe­ses, so it can’t be aware of it­self, but your sec­ond and third ex­am­ples seem like they to­tally could lead to self-aware sys­tems. In fact, the third ex­am­ple sounds like a de­scrip­tion of hu­mans, and hu­mans are self-aware.

Over­all I don’t see how we could tell in ad­vance whether a sys­tem would be self-un­aware or not.

• So far it is more of a po­si­tion pa­per say­ing “we should do this” rather than “we have done this”, so it’s not su­per clear what tasks the helper-bot does. The hu­man in­ter­acts with the bot by mes­sag­ing it in nor­mal English. Here’s an ex­am­ple of the kind of thing they want to be able to do, taken di­rectly from the pa­per:

PLAYER: build a tower 15 blocks tall and then put a gi­ant smiley on top
ASSISTANT: ok [as­sis­tant starts build­ing the tower]
PLAYER: wait, stop, make ev­ery other block red [as­sis­tant rec­og­nizes the in­struc­tion refers to a change in its cur­rent build task, but doesn’t un­der­stand the change (and speci­fi­cally rec­og­nizes “ev­ery other” as un­known)]
ASSISTANT: What is “ev­ery other”?
PLAYER: Let me show you [player makes a stack of blocks al­ter­nat­ing in color; as­sis­tant is able to gen­er­al­ize “ev­ery other” to new situ­a­tions]

My pro­ject is not at all re­lated in the sense that this pa­per was a sur­prise to me when it was re­leased a few days ago, but it is very re­lated in the sense that I too am try­ing to build a bench­mark of Minecraft tasks where there aren’t ob­vi­ous re­ward func­tions. I’m less fo­cused on nat­u­ral lan­guage though—the hope is that peo­ple could try to solve it us­ing other tech­niques as well, such as IRL or imi­ta­tion learn­ing.

• Oh, I see, so the ar­gu­ment is that con­di­tional on the ideal­ized syn­the­sis al­gorithm be­ing a good defi­ni­tion of hu­man prefer­ences, the AI can ap­prox­i­mate the syn­the­sis al­gorithm, and what­ever util­ity func­tion it comes up with and op­ti­mizes should not have any hu­man-iden­ti­fi­able prob­lems. That makes sense. Fol­lowup ques­tions:

• How do you tell the AI sys­tem to op­ti­mize for “what the ideal­ized syn­the­sis al­gorithm would do”?

• How can we be con­fi­dent that the ideal­ized syn­the­sis al­gorithm ac­tu­ally cap­tures what we care about?

• For ex­am­ple, imag­ine that the AI, for ex­am­ple, ex­tin­guished all mean­ingful hu­man in­ter­ac­tions be­cause these can some­times be painful and the AI knows that we pre­fer to avoid pain. But it’s clear to us that most peo­ple’s par­tial prefer­ences will not en­dorse to­tal loneli­ness as good out­come; if it’s clear to us, then it’s a for­tiori clear to a very in­tel­li­gent AI; hence the AI will avoid that failure sce­nario.

I don’t un­der­stand this. My un­der­stand­ing is that you are propos­ing that we build a cus­tom prefer­ence in­fer­ence and syn­the­sis al­gorithm, that’s sep­a­rate from the AI. This pro­duces a util­ity func­tion that is then fed into the AI. But if this is the case, then you can’t use the AI’s in­tel­li­gence to ar­gue that the syn­the­sis al­gorithm will work well, since they are sep­a­rate.

Per­haps you do in­tend for the syn­the­sis al­gorithm to be part of “the AI”? If so, can you say more about how that works? What as­sump­tions about the AI do you need to be true?

• Fwiw I also think it is not nec­es­sary to know lots of ar­eas of math for AI safety re­search. Note that I do in fact know a lot of ar­eas of math rel­a­tively shal­lowly.

I do think it is im­por­tant to be able to do math­e­mat­i­cal rea­son­ing, which I can roughly op­er­a­tional­ize as get­ting to the postri­gor­ous stage in at least one area of math.

• … Plau­si­bly? Idk, it’s very hard for me to talk about the val­idity of in­tu­itions in an in­for­mal, in­tu­itive model that I don’t share. I don’t see any­thing ob­vi­ously wrong with it.

There’s the usual is­sue that Bayesian rea­son­ing doesn’t prop­erly ac­count for em­bed­ded­ness, but I don’t think that would make much of a differ­ence here.

Note that even if AI re­searchers do this similarly to other groups of peo­ple, that doesn’t change the con­clu­sion that there are dis­tor­tions that push to­wards shorter timelines.

• Sorry in ad­vance for how un­helpful this is go­ing to be. I think de­com­pos­ing an agent into “goals”, “world-model”, and “plan­ning” is the wrong way to be de­com­pos­ing agents. I hope to write a post about this soon.

• I think I’m un­der­stand­ing you to be con­cep­tu­al­iz­ing a di­chotomy be­tween “un­cer­tainty over a util­ity func­tion” vs. “look­ing for the one true util­ity func­tion”.

Well, I don’t per­son­ally en­dorse this. I was spec­u­lat­ing on what might be rele­vant to Stu­art’s un­der­stand­ing of the prob­lem.

I was try­ing to point to­wards the di­chotomy be­tween “act­ing while hav­ing un­cer­tainty over a util­ity func­tion” vs. “act­ing with a known, cer­tain util­ity func­tion” (see e.g. The Off-Switch Game). I do know about the prob­lem of fully up­dated defer­ence and I don’t know what Stu­art thinks about it.

Also, for what it’s worth, in the case where there is an uniden­ti­fi­a­bil­ity prob­lem, as there is here, even in the limit, a Bayesian agent won’t con­verge to cer­tainty about a util­ity func­tion.

Agreed, but I’m not sure why that’s rele­vant. Why do you need cer­tainty about the util­ity func­tion, if you have cer­tainty about the policy?

• Does this not sound like a plan of run­ning (C)IRL to get the one true util­ity func­tion?

I do not think that is ac­tu­ally his plan, but I agree it sounds like it. One caveat is that I think the un­cer­tainty over prefer­ences/​re­wards is key to this story, which is a bit differ­ent from get­ting a sin­gle true util­ity func­tion.

But re­ally my an­swer is, the in­fer­en­tial dis­tance be­tween Stu­art and the typ­i­cal reader of this fo­rum is very large. (The in­fer­en­tial dis­tance be­tween Stu­art and me is very large.) I sus­pect he has very differ­ent em­piri­cal be­liefs, such that you could rea­son­ably say that he’s work­ing on a “differ­ent prob­lem”, in the same way that MIRI and I work on rad­i­cally differ­ent stuff mostly due to differ­ent em­piri­cal be­liefs.

• Planned sum­mary:

This post ar­gues that AI re­searchers and AI or­ga­ni­za­tions have an in­cen­tive to pre­dict that AGI will come soon, since that leads to more fund­ing, and so we should ex­pect timeline es­ti­mates to be sys­tem­at­i­cally too short. Be­sides the con­cep­tual ar­gu­ment, we can also see this in the field’s re­sponse to crit­ics: both his­tor­i­cally and now, crit­i­cism is of­ten met with coun­ter­ar­gu­ments based on “style” rather than en­gag­ing with the tech­ni­cal meat of the crit­i­cism.

Planned opinion:

I agree with the con­cep­tual ar­gu­ment, and I think it does hold in prac­tice, quite strongly. I don’t re­ally agree that the field’s re­sponse to crit­ics im­plies that they are bi­ased to­wards short timelines—see these com­ments. Nonethe­less, I’m go­ing to do ex­actly what this post cri­tiques, and say that I put sig­nifi­cant prob­a­bil­ity on short timelines, but not ex­plain my rea­sons (be­cause they’re com­pli­cated and I don’t think I can con­vey them, and cer­tainly can’t con­vey them in a small num­ber of words).

• My main point is that IRL, as it is typ­i­cally de­scribed, feels nearly com­plete: just throw in a more ad­vanced RL al­gorithm as a sub­rou­tine and some nar­row-AI-type add-on for iden­ti­fy­ing hu­man ac­tions from a video feed, and voila, we have a su­per­hu­man hu­man helper.
But maybe we could be spend­ing more effort try­ing to fol­low through to fully speci­fied pro­pos­als which we can prop­erly put through the gaunt­let.

Re­gard­less of whether it is in­tended or not, this sounds like a dig at CHAI’s work. I do not think that IRL is “nearly com­plete”. I ex­pect that re­searchers who have been at CHAI for at least a year do not think that IRL is “nearly com­plete”. I wrote a se­quence partly for the pur­pose of tel­ling ev­ery­one “No, re­ally, we don’t think that we just need to run IRL to get the one true util­ity func­tion; we aren’t even in­ves­ti­gat­ing that plan”.

(Sorry, this shouldn’t be di­rected just at you in par­tic­u­lar. I’m an­noyed at how of­ten I have to ar­gue against this per­cep­tion, and this pa­per hap­pened to prompt me to ac­tu­ally write some­thing.)

Also, I don’t agree that “see if an AIXI-like agent would be al­igned” is the cor­rect “gaunt­let” to be think­ing about; that kind of al­ign­ment seems doomed to me, but in any case the AI sys­tems we ac­tu­ally build are not go­ing to look any­thing like that.

• Strongly agree. Another benefit is that it ex­poses you to a broader swath of the world, which makes your mod­els of the world bet­ter /​ more gen­er­al­iz­able. I of­ten feel like the ra­tio­nal­ist com­mu­nity has “be­liefs about peo­ple” that I think only ap­ply to a small sub­set of peo­ple, e.g.

• Peo­ple need to find mean­ing in their jobs to be happy

• Every­one thinks that the thing that they are do­ing is “good for the world” or “morally right” (as op­posed to think­ing that the thing they are do­ing is jus­tifi­able /​ rea­son­able to do)

• I see, so the ar­gu­ment is mostly that jobs are performed more sta­bly and so you can learn bet­ter how to deal with the prin­ci­pal-agent prob­lems that arise. This seems plau­si­ble.

• I don’t think that’s it. The in­fer­ence I most dis­agree with is “ra­tio­nal­ity must have a sim­ple core”, or “Oc­cam’s ra­zor works on ra­tio­nal­ity”. I’m sure there’s some mean­ing of “fun­da­men­tal” or “episte­molog­i­cally ba­sic” such that I’d agree that ra­tio­nal­ity has that prop­erty, but that doesn’t en­tail “ra­tio­nal­ity has a sim­ple core”.

• The core of my in­tu­ition is that with differ­ent op­ti­mized AIs, it will be straight­for­ward to de­ter­mine ex­actly what the prin­ci­pal-agent prob­lem con­sists of, and this can be com­pen­sated for.

I feel like it is not too hard to de­ter­mine prin­ci­pal-agent prob­lems with hu­mans ei­ther? It’s just hard to ad­e­quately com­pen­sate for them.

• Would you as­so­ci­ate “am­bi­tious value learn­ing vs. ad­e­quate value learn­ing” with “works in the­ory vs. doesn’t work in the­ory but works in prac­tice”?

Po­ten­tially. I think the main ques­tion is whether ad­e­quate value learn­ing will work in prac­tice.

• More­over, there is a core differ­ence be­tween the growth of the cost of brain size be­tween hu­mans and AI (sub­lin­ear vs lin­ear).

Ac­tu­ally, I was imag­in­ing that for hu­mans the cost of brain size grows su­per­lin­early. The pa­per you linked uses a quadratic func­tion, and also tried an ex­po­nen­tial and found similar re­sults.

But in the world where AI dev faces hard­ware con­straints, so­cial learn­ing will be much more use­ful.

Agreed if the AI uses so­cial learn­ing to learn from hu­mans, but that only gets you to hu­man-level AI. If you want to ar­gue for some­thing like fast take­off to su­per­in­tel­li­gence, you need to talk about how the AI learns in­de­pen­dently of hu­mans, and in that set­ting so­cial learn­ing won’t be use­ful given lin­ear costs.

E.g. Sup­pose that each unit of adap­tive knowl­edge re­quires one unit of aso­cial learn­ing. Every unit of learn­ing costs $K, re­gard­less of brain size, so that ev­ery­thing is lin­ear. No mat­ter how much so­cial learn­ing you have, the dis­cov­ery of units of knowl­edge is go­ing to cost$, so the best thing you can do is put units of aso­cial learn­ing in a sin­gle brain/​model so that you don’t have to pay any cost for so­cial learn­ing.

In con­trast, if units of aso­cial learn­ing in a sin­gle brain costs $, then hav­ing N units of aso­cial learn­ing in a sin­gle brain/​model is very ex­pen­sive. You can in­stead have sep­a­rate brains each with 1 unit of aso­cial learn­ing, for a to­tal cost of$, and that is enough to dis­cover the units of knowl­edge. You can then in­vest a unit or two of so­cial learn­ing for each brain/​model so that they can all ac­cu­mu­late the units of knowl­edge, giv­ing a to­tal cost that is still lin­ear in .

I’m claiming that AI is more like the former while this pa­per’s model is more like the lat­ter. Higher hard­ware con­straints only changes the value of , which doesn’t af­fect this anal­y­sis.