[Question] What are some exercises for building/​generating intuitions about key disagreements in AI alignment?

I am interested in having my own opinion about more of the key disagreements within the AI alignment field, such as whether there is a basin of attraction for corrigibility, whether there is a theory of rationality that is sufficiently precise to build hierarchies of abstraction, and to what extent there will be a competence gap.

In “Is That Your True Rejection?”, Eliezer Yudkowsky wrote:

I suspect that, in general, if two rationalists set out to resolve a disagreement that persisted past the first exchange, they should expect to find that the true sources of the disagreement are either hard to communicate, or hard to expose. E.g.:

  • Uncommon, but well-supported, scientific knowledge or math;

  • Long inferential distances;

  • Hard-to-verbalize intuitions, perhaps stemming from specific visualizations;

  • Zeitgeists inherited from a profession (that may have good reason for it);

  • Patterns perceptually recognized from experience;

  • Sheer habits of thought;

  • Emotional commitments to believing in a particular outcome;

  • Fear that a past mistake could be disproved;

  • Deep self-deception for the sake of pride or other personal benefits.

I am assuming that something like this is happening in the key disagreements in AI alignment. The last three bullet points are somewhat uncharitable to proponents of a particular view, and also seem less likely to me. Summarizing the first six bullet points, I want to say something like: some combination of “innate intuitions” and “life experiences” led e.g. Eliezer and Paul Christiano to arrive at different opinions. I want to go through a useful subset of the “life experiences” part, so that I can share some of the same intuitions.

To that end, my question is something like: What fields should I learn? What textbooks/​textbook chapters/​papers/​articles should I read? What historical examples (from history of AI/​ML or from the world at large) should I spend time thinking about? (The more specific the resource, the better.) What intuitions should I expect to build by going through this resource? In the question title I am using the word “exercise” pretty broadly.

If you believe one just needs to be born with one set of intuitions rather than another, and that there are no resources I can consume to refine my intuitions, then my question is instead more like: How can I better introspect so as to find out which side I am on :)?

Some ideas I am aware of:

  • Reading discussions between Eliezer/​Paul/​other people: I’ve already done a lot of this; it just feels like now I am no longer making much progress.

  • Learn more theoretical computer science to learn the Search For Solutions And Fundamental Obstructions intuition, as mentioned in this post.

No comments.