[Question] What are some exercises for building/​generating intuitions about key disagreements in AI alignment?

I am in­ter­ested in hav­ing my own opinion about more of the key dis­agree­ments within the AI al­ign­ment field, such as whether there is a basin of at­trac­tion for cor­rigi­bil­ity, whether there is a the­ory of ra­tio­nal­ity that is suffi­ciently pre­cise to build hi­er­ar­chies of ab­strac­tion, and to what ex­tent there will be a com­pe­tence gap.

In “Is That Your True Re­jec­tion?”, Eliezer Yud­kowsky wrote:

I sus­pect that, in gen­eral, if two ra­tio­nal­ists set out to re­solve a dis­agree­ment that per­sisted past the first ex­change, they should ex­pect to find that the true sources of the dis­agree­ment are ei­ther hard to com­mu­ni­cate, or hard to ex­pose. E.g.:

  • Un­com­mon, but well-sup­ported, sci­en­tific knowl­edge or math;

  • Long in­fer­en­tial dis­tances;

  • Hard-to-ver­bal­ize in­tu­itions, per­haps stem­ming from spe­cific vi­su­al­iza­tions;

  • Zeit­geists in­her­ited from a pro­fes­sion (that may have good rea­son for it);

  • Pat­terns per­cep­tu­ally rec­og­nized from ex­pe­rience;

  • Sheer habits of thought;

  • Emo­tional com­mit­ments to be­liev­ing in a par­tic­u­lar out­come;

  • Fear that a past mis­take could be dis­proved;

  • Deep self-de­cep­tion for the sake of pride or other per­sonal benefits.

I am as­sum­ing that some­thing like this is hap­pen­ing in the key dis­agree­ments in AI al­ign­ment. The last three bul­let points are some­what un­char­i­ta­ble to pro­po­nents of a par­tic­u­lar view, and also seem less likely to me. Sum­ma­riz­ing the first six bul­let points, I want to say some­thing like: some com­bi­na­tion of “in­nate in­tu­itions” and “life ex­pe­riences” led e.g. Eliezer and Paul Chris­ti­ano to ar­rive at differ­ent opinions. I want to go through a use­ful sub­set of the “life ex­pe­riences” part, so that I can share some of the same in­tu­itions.

To that end, my ques­tion is some­thing like: What fields should I learn? What text­books/​text­book chap­ters/​pa­pers/​ar­ti­cles should I read? What his­tor­i­cal ex­am­ples (from his­tory of AI/​ML or from the world at large) should I spend time think­ing about? (The more spe­cific the re­source, the bet­ter.) What in­tu­itions should I ex­pect to build by go­ing through this re­source? In the ques­tion ti­tle I am us­ing the word “ex­er­cise” pretty broadly.

If you be­lieve one just needs to be born with one set of in­tu­itions rather than an­other, and that there are no re­sources I can con­sume to re­fine my in­tu­itions, then my ques­tion is in­stead more like: How can I bet­ter in­tro­spect so as to find out which side I am on :)?

Some ideas I am aware of:

  • Read­ing dis­cus­sions be­tween Eliezer/​Paul/​other peo­ple: I’ve already done a lot of this; it just feels like now I am no longer mak­ing much progress.

  • Learn more the­o­ret­i­cal com­puter sci­ence to learn the Search For Solu­tions And Fun­da­men­tal Ob­struc­tions in­tu­ition, as men­tioned in this post.

No comments.