Vika comments on Looking back on my alignment PhD

Vika 1 Jul 2022 16:46 UTC
26 points
15
Thanks Alex for writing this. I think the social failure modes you described in the Mistakes section are all too common, and I’ve often found myself held back by these.
I agree that impact measures are not super useful for alignment (apart from deconfusion) and I’ve also moved on from working on this topic. Improving our understanding of power-seeking seems pretty useful though, so I’m curious why you wish you had stopped working on it sooner.
- TurnTrout 3 Jul 2022 1:13 UTC
  11 points
  −7
  Parent
  Research on power-seeking tendencies is more useful than nothing, but consider the plausibility of the following retrospective: “AI alignment might not have been solved except for TurnTrout’s deconfusion of power-seeking tendencies.” Doesn’t sound like something which would actually happen in reality, does it?
  EDIT: Note this kind of visualization is not always valid—it’s easy to diminish a research approach by reframing it—but in this case I think it’s fine and makes my point.
  - Vika 4 Jul 2022 18:37 UTC
    11 points
    4
    Parent
    I think it’s plausible that the alignment community could figure out how to build systems without power-seeking incentives, or with power-seeking tendencies limited to some safe set of options, by building on your formalization, so the retrospective seems plausible to me.
    In addition, this work is useful for convincing ML people that alignment is hard, which helps to lay the groundwork for coordinating the AI community to not build AGI. I’ve often pointed researchers at DM (especially RL people) to your power-seeking paper when trying to explain convergent instrumental goals (a formal neurips paper makes a much better reference for that audience than Basic AI Drives).