[Question] What alignment-related concepts should be better known in the broader ML community?

We want to work towards a world in which the alignment problem is a mainstream concern among ML researchers. An important part of this is popularizing alignment-related concepts within the ML community. Here’s a few recent examples:

  • Reward hacking /​ misspecification (blog post)

  • Convergent instrumental goals (paper)

  • Objective robustness (paper)

  • Assistance (paper)

(I’m sure this list is missing many examples; let me know if there are any in particular I should include).

Meanwhile, there are many other things that alignment researchers have been thinking about that are not well known within the ML community. Which concepts would you most want to be more widely known /​ understood?

No comments.