I’m more active on Twitter than LW/AF these days: https://twitter.com/DavidSKrueger
Bio from https://www.davidscottkrueger.com/:
I am an Assistant Professor at the University of Cambridge and a member of Cambridge’s Computational and Biological Learning lab (CBL). My research group focuses on Deep Learning, AI Alignment, and AI safety. I’m broadly interested in work (including in areas outside of Machine Learning, e.g. AI governance) that could reduce the risk of human extinction (“x-risk”) resulting from out-of-control AI systems. Particular interests include:
Reward modeling and reward gaming
Aligning foundation models
Understanding learning and generalization in deep learning and foundation models, especially via “empirical theory” approaches
Preventing the development and deployment of socially harmful AI systems
Elaborating and evaluating speculative concerns about more advanced future AI systems
First, RE the role of “solving alignment” in this discussion, I just want to note that:
1) I disagree that alignment solves gradual disempowerment problems.
2) Even if it would that does not imply that gradual disempowerment problems aren’t important (since we can’t assume alignment will be solved).
3) I’m not sure what you mean by “alignment is solved”; I’m taking it to mean “AI systems can be trivially intent aligned”. Such a system may still say things like “Well, I can build you a successor that I think has only a 90% chance of being aligned, but will make you win (e.g. survive) if it is aligned. Is that what you want?” and people can respond with “yes”—this is the sort of thing that probably still happens IMO.
4) Alternatively, you might say we’re in the “alignment basin”—I’m not sure what that means, precisely, but I would operationalize it as something like “the AI system is playing a roughly optimal CIRL game”. It’s unclear how good of performance that can yield in practice (e.g. it can’t actually be optimal due to compute limitations), but I suspect it still leaves significant room for fuck-ups.
5) I’m more interested in the case where alignment is not “perfectly” “solved”, and so there are simply clear and obvious opportunities to trade-off safety and performance; I think this is much more realistic to consider.
6) I expect such trade-off opportunities to persist when it comes to assurance (even if alignment is solved), since I expect high-quality assurance to be extremely costly. And it is irresponsible (because it’s subjectively risky) to trust a perfectly aligned AI system absent strong assurances. But of course, people who are willing to YOLO it and just say “seems aligned, let’s ship” will win. This is also part of the problem...
My main response, at a high level:
Consider a simple model:
We have 2 human/AI teams in competition with each other, A and B.
A and B both start out with the humans in charge, and then decide whether the humans should stay in charge for the next week.
Whichever group has more power at the end of the week survives.
The humans in A ask their AI to make A as powerful as possible at the end of the week.
The humans in B ask their AI to make B as powerful as possible at the end of the week, subject to the constraint that the humans in B are sure to stay in charge.
I predict that group A survives, but the humans are no longer in power. I think this illustrates the basic dynamic. EtA: Do you understand what I’m getting at? Can you explain what you think it wrong with thinking of it this way?
Responding to some particular points below:
Yes, they do; they result in beaurocracies and automated decision-making systems obtaining power. People were already having to implement and interact with stupid automated decision-making systems before AI came along.
My main claim was not that these are mechanisms of human disempowerment (although I think they are), but rather that they are indicators of the overall low level of functionality of the world.