Thanks Alex for writing this. I think the social failure modes you described in the Mistakes section are all too common, and I’ve often found myself held back by these.
I agree that impact measures are not super useful for alignment (apart from deconfusion) and I’ve also moved on from working on this topic. Improving our understanding of power-seeking seems pretty useful though, so I’m curious why you wish you had stopped working on it sooner.
Research on power-seeking tendencies is more useful than nothing, but consider the plausibility of the following retrospective: “AI alignment might not have been solved except for TurnTrout’s deconfusion of power-seeking tendencies.” Doesn’t sound like something which would actually happen in reality, does it?
EDIT: Note this kind of visualization is not always valid—it’s easy to diminish a research approach by reframing it—but in this case I think it’s fine and makes my point.
I think it’s plausible that the alignment community could figure out how to build systems without power-seeking incentives, or with power-seeking tendencies limited to some safe set of options, by building on your formalization, so the retrospective seems plausible to me.
In addition, this work is useful for convincing ML people that alignment is hard, which helps to lay the groundwork for coordinating the AI community to not build AGI. I’ve often pointed researchers at DM (especially RL people) to your power-seeking paper when trying to explain convergent instrumental goals (a formal neurips paper makes a much better reference for that audience than Basic AI Drives).
Thanks Alex for writing this. I think the social failure modes you described in the Mistakes section are all too common, and I’ve often found myself held back by these.
I agree that impact measures are not super useful for alignment (apart from deconfusion) and I’ve also moved on from working on this topic. Improving our understanding of power-seeking seems pretty useful though, so I’m curious why you wish you had stopped working on it sooner.
Research on power-seeking tendencies is more useful than nothing, but consider the plausibility of the following retrospective: “AI alignment might not have been solved except for TurnTrout’s deconfusion of power-seeking tendencies.” Doesn’t sound like something which would actually happen in reality, does it?
EDIT: Note this kind of visualization is not always valid—it’s easy to diminish a research approach by reframing it—but in this case I think it’s fine and makes my point.
I think it’s plausible that the alignment community could figure out how to build systems without power-seeking incentives, or with power-seeking tendencies limited to some safe set of options, by building on your formalization, so the retrospective seems plausible to me.
In addition, this work is useful for convincing ML people that alignment is hard, which helps to lay the groundwork for coordinating the AI community to not build AGI. I’ve often pointed researchers at DM (especially RL people) to your power-seeking paper when trying to explain convergent instrumental goals (a formal neurips paper makes a much better reference for that audience than Basic AI Drives).