Amplification Discussion Notes

Paul Chris­ti­ano, Wei Dai, An­dreas Stuh­lmüller and I had an on­line chat dis­cus­sion re­cently, the tran­script of the dis­cus­sion is available here. (Dis­claimer that it’s a non­stan­dard for­mat and we weren’t op­ti­miz­ing for ease of un­der­stand­ing the tran­script). This dis­cus­sion was pri­mar­ily fo­cused on am­plifi­ca­tion of hu­mans (not later am­plifi­ca­tion steps in IDA). Below are some high­lights from the dis­cus­sion, and I’ll in­clude some ques­tions that were raised that might merit fur­ther dis­cus­sion in the com­ments.

Highlights

Strate­gies for sam­pling from a hu­man dis­tri­bu­tion of solu­tions:

Paul: For ex­am­ple you can use “Use ran­dom hu­man ex­am­ple,” or “find an anal­ogy to an­other ex­am­ple you know and use it to gen­er­ate an ex­am­ple,” or what­ever.
There is some sub­tlety there, where you want to train the model that sam­ple from the real hu­man dis­tri­bu­tion rather than from the em­piri­cal dis­tri­bu­tion of 10 pro­pos­als you hap­pen to have col­lected so far. If sam­ples are cheap that’s fine. Other­wise you may need to go fur­ther to “Given that [X1, X2, …] are suc­cess­ful de­signs, what is a pro­ce­dure that can pro­duce ad­di­tional suc­cess­ful de­signs?” or some­thing like that. Not sure.

Deal­ing with un­known concepts

An­dreas: Sup­pose you get a top-level com­mand that con­tains words that H doesn’t un­der­stand (or just doesn’t look at), say some­thing like “Gyre a far­bled bleg.“. You have ac­cess to some data source that is in prin­ci­ple enough to learn the mean­ings of those words. What might the first few lev­els of ques­tions + an­swers look like?
Paul: pos­si­ble ques­tions: “What’s the mean­ing of the com­mand”, which goes to “What’s the mean­ing of word X” for the words X in the sen­tence, “What idio­matic con­struc­tions are in­volved in this sen­tence?“, “What gram­mat­i­cal con­struc­tions are in­volved in the sen­tence”
An­swers to those ques­tions are big trees rep­re­sent­ing mean­ings, e.g. a list of prop­er­ties of “gyre” (what prop­er­ties the sub­ject and ob­ject typ­i­cally have, un­der what con­di­tions it is said to have oc­curred, why some­one might want you to do it, tons of stuff most of which will be ir­rele­vant for the query)
Which come from look­ing up defi­ni­tions, propos­ing defi­ni­tions and see­ing how well they match with us­age in the cases you can look at, etc.

Limits on what am­plifi­ca­tion can accomplish

Paul:In gen­eral, if ML can’t learn to do a task, then that’s fine with me. And if ML can learn to do a task but only us­ing data source X, then we are go­ing to have to in­te­grate data source X into the am­plifi­ca­tion pro­cess in or­der for am­plifi­ca­tion to be able to solve it, there is no way to re­move the de­pen­dence on ar­bi­trary data sources. And there will ex­ist data sources which pose al­ign­ment is­sues, in­de­pen­dent of any al­ign­ment is­sues posed by the ML.

Align­ment search for cre­ative solutions

Con­sid­er­ing the task of gen­er­at­ing a solu­tion to a prob­lem that re­quires cre­ativity, it can be de­com­posed into:

Gen­er­ate solutions

Eval­u­ate those solutions

For solu­tion gen­er­a­tion, one idea is to shape the dis­tri­bu­tion of pro­pos­als so you are less likely to get ma­lign an­swers (ie. sam­ple from the dis­tri­bu­tion of an­swers a hu­man would give, which would hope­fully be more likely to be safe/​eas­ily eval­u­ated com­pared to some ar­bi­trary dis­tri­bu­tion).

I asked Paul if he thought that safe cre­ative solu­tion gen­er­a­tion would re­quire sam­pling from a less ma­lign dis­tri­bu­tion, or whether he thought we could solve eval­u­a­tion (“se­cure-X-eval­u­a­tion”, as test­ing whether the solu­tion fulfilled prop­erty X) well enough to use an ar­bi­trary dis­tri­bu­tion/​brute force search.

Paul: I don’t see a good way to avoid solv­ing se­cure X-eval­u­a­tion any­way. It seems to me like we can gen­er­ate solu­tions in ways that put much lower prob­a­bil­ity on ma­lign an­swers, but it nei­ther seems like we can to­tally elimi­nate that (I don’t think hu­man cre­ativity to­tally elimi­nates that ei­ther), nor that we will always have ac­cess to some more-al­igned hu­man gen­er­a­tor
The best I’d prob­a­bly say is that we can have a gen­er­a­tion pro­cess that is not it­self ma­lign, not clear if that is helpful at all though.

We then dived into how well we could solve se­cure X-eval­u­a­tion. I was par­tic­u­larly in­ter­ested in ques­tions like how we could eval­u­ate whether a de­sign had po­ten­tially harm­ful side-effects.

Paul: I think what we want is some­thing like: if the de­sign­ing pro­cess knows that X is bad, then the eval­u­a­tor will also know it. If the de­sign­ing pro­cess doesn’t know that X is bad, then that’s not ma­lign.
[to be clear, for this dis­cus­sion we only need se­cu­rity in the in­finite limit; in prac­tice the ca­pa­bil­ity of both sides will be limited by the ca­pa­bil­ity of the ML, so we’ll also need some­thing to make sure the eval­u­at­ing-ML does bet­ter than the gen­er­a­tor-ML, but that seems like a sep­a­rate is­sue.]
William: If you imag­ine slowly in­creas­ing the in­tel­li­gence of the gen­er­a­tor, then for any heuris­tic, it might first start pick­ing solu­tions that fulfill that heuris­tic more of­ten be­fore ac­tu­ally un­der­stand­ing the heuris­tic, and it might take longer af­ter that be­fore the gen­er­a­tor un­der­stands that the heuris­tic works be­cause of a causal path­way that in­volves nega­tive side effects. Is it the case that you’d say that this is an ac­cept­able out­come/​some­thing that we can’t re­ally get past?
Paul: If nei­ther the eval­u­a­tor nor gen­er­a­tor knows about the nega­tive side effect, it’s hard for the nega­tive side effect to lead to higher eval­u­a­tions. I agree this can hap­pen some­times (I wrote the im­plicit ex­tor­tion post to give an ex­am­ple, there are cer­tainly oth­ers), but they seem OK to ac­cept as “hon­est mis­takes” so far, i.e. none of them pose an ex­is­ten­tial risk.
in terms of “what am­plifi­ca­tion is sup­posed to ac­com­plish,” if there is a prob­lem that could just as well af­flict a hu­man who is try­ing their best to help me get what I want (and has the AI’s pro­file of abil­ities), then I’m ba­si­cally con­sid­er­ing that out of scope.
What­ever we could tell to a hu­man, to help them avoid this kind of prob­lem, we could also tell to an al­igned AI, so the prob­lem is fac­tored into (a) help a hu­man avoid the prob­lem, (b) build al­igned AI.