Among other things, this post promotes the thesis that (single/single) AI alignment is insufficient for AI existential safety and the current focus of the AI risk community on AI alignment is excessive. I’ll try to recap the idea the way I think of it.
We can roughly identify 3 dimensions of AI progress: AI capability, atomic AI alignment and social AI alignment. Here, atomic AI alignment is the ability to align a single AI system with a single user, whereas social AI alignment is the ability to align the sum total of AI systems with society as a whole. Depending on the relative rates at which those 3 dimensions develop, there are roughly 3 possible outcomes (ofc in reality it’s probably more of a spectrum):
Outcome A: The classic “paperclip” scenario. Progress in atomic AI alignment doesn’t keep up with progress in AI capability. Transformative AI is unaligned with any user, as a result the future contains virtually nothing of value to us.
Outcome B: Progress in atomic AI alignment keeps up with progress in AI capability, but progress in social AI alignment doesn’t keep up. Transformative AI is aligned with a small fraction of the population, resulting in this minority gaining absolute power and abusing it to create an extremely inegalitarian future. Wars between different factions are also a concern.
Outcome C: Both atomic and social alignment keep with with AI capability. Transformative AI is aligned with society/humanity as a whole, resulting in a benevolent future for everyone.
Ideally, Outcome C is the outcome we want (with the exception of people who decided to gamble on being part of the elite in outcome B). Arguably, C > B > A (although it’s possible to imagine scenarios in which B < A). How does it translate into research priorities? This depends on several parameters:
The “default” pace of progress in each dimension: e.g. if we assume atomic AI alignment will be solved in time anyway, then we should focus on social AI alignment.
The inherent difficulty of each dimension: e.g. if we assume atomic AI alignment is relatively hard (and will therefore take a long time to solve) whereas social AI alignment becomes relatively easy once atomic AI alignment is solved, then we should focus on atomic AI alignment.
The extent to which each dimension depends on others: e.g. if we assume it’s impossible to make progress in social AI alignment without reaching some milestone in atomic AI alignment, then we should focus on atomic AI alignment for now. Similarly, some argued we shouldn’t work on alignment at all before making more progress in capability.
More precisely, the last two can be modeled jointly as the cost of marginal progress in a given dimension as a function of total progress in all dimensions.
The extent to which outcome B is bad for people not in the elite: If it’s not too bad then it’s more important to prevent outcome A by focusing on atomic AI alignment, and vice versa.
The OP’s conclusion seems to be that social AI alignment should be the main focus. Personally, I’m less convinced. It would be interesting to see more detailed arguments about the above parameters that support or refute this thesis.
Outcome B: Progress in atomic AI alignment keeps up with progress in AI capability, but progress in social AI alignment doesn’t keep up. Transformative AI is aligned with a small fraction of the population, resulting in this minority gaining absolute power and abusing it to create an extremely inegalitarian future. Wars between different factions are also a concern.
It’s unclear to me how this particular outcome relates to social alignment (or at least to the kinds of research areas in this post). Some possibilities:
Does failure to solve social alignment mean that firms and governments cannot use AI to represent their shareholders and constituents? Why might that be? (E.g. what’s a plausible approach to atomic alignment that couldn’t be used by a firm or government?)
Does AI progress occur unevenly such that some group gets much more power/profit, and then uses that power? If so, how would technical progress on alignment help address that outcome? (Why would the group with power be inclined to use whatever techniques we’re imagining?) Also, why does this happen?
Does AI progress somehow complicate the problem of governance or corporate governance such that those organizations can no longer represent their constituents/shareholders? What is the mechanism (or any mechanism) by which this happens? Does social alignment help by making new forms of organization possible, and if so should I just be thinking of it as a way of improving those institutions, or is it somehow distinctive?
Do we already believe that the situation is gravely unequal (e.g. because governments can’t effectively represent their constituents and most people don’t have a meaningful amount of capital) and AI progress will exacerbate that situation? How does social alignment prevent that?
(This might make more sense as a question for the OP, it just seemed easier to engage with this comment since it describes a particular more concrete possibility. My sense is that the OP may be more concerned about failures in which no one gets what they want rather than outcome B per se.)
Outcome C is most naturally achieved using “direct democracy” TAI, i.e. one that collects inputs from everyone and aggregates them in a reasonable way. We can try emulating democratic AI via single user AI, but that’s hard because:
If the number of AIs is small, the AI interface becomes a single point of failure, an actor that can hijack the interface will have enormous power.
If the number of AIs is small, it might be unclear what inputs should be fed into the AI in order to fairly represent the collective. It requires “manually” solving the preference aggregation problem, and faults of the solution might be amplified by the powerful optimization to which it is subjected.
If the number of AIs is more than one then we should make sure the AIs are good at cooperating, which requires research about multi-AI scenarios.
If the number of AIs is large (e.g. one per person), we need the interface to be sufficiently robust that people can use it correctly without special training. Also, this might be prohibitively expensive.
Designing democratic AI requires good theoretical solutions for preference aggregation and the associated mechanism design problem, and good practical solutions for making it easy to use and hard to hack. Moreover, we need to get the politicians to implement those solutions. Regarding the latter, the OP argues that certain types of research can help lay the foundation by providing actionable regulation proposals.
My sense is that the OP may be more concerned about failures in which no one gets what they want rather than outcome B per se
Well, the OP did say:
(2) is essentially aiming to take over the world in the name of making it safer, which is not generally considered the kind of thing we should be encouraging lots of people to do.
I understood it as hinting at outcome B, but I might be wrong.
Outcome C is most naturally achieved using “direct democracy” TAI, i.e. one that collects inputs from everyone and aggregates them in a reasonable way. We can try emulating democratic AI via single user AI, but that’s hard because:
I’m not sure what’s most natural, but I do consider this a fairly unlikely way of achieving outcome C.
I think the best argument for this kind of outcome is from Wei Dai, but I don’t think it gets you close to the “direct democracy” outcome. (Even if you had state control and AI systems aligned with the state, it seems unlikely and probably undesirable for the state to be replaced with an aggregation procedure implemented by the AI itself.)
A lot depends on AI capability as a function of cost and time. On one extreme, there might enough rising returns to get a singleton: some combination of extreme investment and algorithmic advantage produces extremely powerful AI, moderate investment or no algorithmic advantage doesn’t produce moderately powerful AI. Whoever controls the singleton has all the power. On the other extreme, returns don’t rise much, resulting in personal AIs having as much or more collective power as corporate/government AIs. In the middle, there are many powerful AIs but still not nearly as many as people.
In the first scenario, to get outcome C we need the singleton to either be democratic by design, or have a very sophisticated and robust system of controlling access to it.
In the last scenario, the free market would lead to outcome B. Corporate and government actors use their access to capital to gain power through AI until the rest of the population becomes irrelevant. Effectively, AI serves as an extreme amplifier of per-existing power differentials. Arguably, the only way to get outcome C is enforcing democratization of AI through regulation. If this seems extreme, compare it to the way our society handles physical violence. The state has monopoly on violence, and with good reason: without this monopoly, upholding the law would be impossible. But, in the age of superhuman AI, traditional means of violence are irrelevant. The only important weapon is AI.
In the second scenario, we can manage without multi-user alignment. However, we still need to have multi-AI alignment, i.e. make sure the AIs are good at coordination problems. It’s possible that any sufficiently capable AI is automatically good at coordination problems, but it’s not guaranteed. (Incidentally, if atomic alignment is flawed then it might be actually better for the AIs to be bad at coordination.)
The OP’s conclusion seems to be that social AI alignment should be the main focus. Personally, I’m less convinced. It would be interesting to see more detailed arguments about the above parameters that support or refute this thesis.
Thanks for the feedback, Vanessa. I’ve just written a follow-up post to better illustrate a class of societal-scale failure modes (“unsafe robust agent-agnostic processes”) that constitutes the majority of the probability mass I currently place on human extinction precipitated by transformative AI advancements (especially AGI, and/or high-level machine intelligence in the language of Grade et al). Here it is:
I’d be curious to see if it convinces you that what you call “social alignment” should be our main focus, or at least a much greater focus than currently.
with the exception of people who decided to gamble on being part of the elite in outcome B
Game-theoretically, there’s a better way. Assume that after winning the AI race, it is easy to figure out everyone else’s win probability, utility function and what they would do if they won. Human utility functions have diminishing returns, so there’s opportunity for acausal trade. Human ancestry gives a common notion of fairness, so the bargaining problem is easier than with aliens.
Most of us care some even about those who would take all for themselves, so instead of giving them the choice between none and a lot, we can give them the choice between some and a lot—the smaller their win prob, the smaller the gap can be while still incentivizing cooperation.
Therefore, the AI race game is not all or nothing. The more win probability lands on parties that can bargain properly, the less multiversal utility is burned.
Good point, acausal trade can at least ameliorate the problem, pushing towards atomic alignment. However, we understand acausal trade too poorly to be highly confident it will work. And, “making acausal trade work” might in itself be considered outside of the desiderata of atomic alignment (since it involves multiple AIs). Moreover, there are also actors that have a very low probability of becoming TAI users but whose support is beneficial for TAI projects (e.g. small donors). Since they have no counterfactual AI to bargain on their behalf, it is less likely acausal trade works here.
Yeah, I basically hope that enough people care about enough other people that some of the wealth ends up trickling down to everyone. Win probability is basically interchangeable with other people caring about you and your ressources across the multiverse. Good thing the cosmos is so large.
I don’t think making acausal trade work is that hard. All that is required is:
That the winner cares about the counterfactual versions of himself that didn’t win, or equivalently, is unsure whether they’re being simulated by another winner. (huh, one could actually impact this through memetic work today, though messing with people’s preferences like that doesn’t sound friendly)
That they think to simulate alternate winners before they expand too far to be simulated.
Among other things, this post promotes the thesis that (single/single) AI alignment is insufficient for AI existential safety and the current focus of the AI risk community on AI alignment is excessive. I’ll try to recap the idea the way I think of it.
We can roughly identify 3 dimensions of AI progress: AI capability, atomic AI alignment and social AI alignment. Here, atomic AI alignment is the ability to align a single AI system with a single user, whereas social AI alignment is the ability to align the sum total of AI systems with society as a whole. Depending on the relative rates at which those 3 dimensions develop, there are roughly 3 possible outcomes (ofc in reality it’s probably more of a spectrum):
Outcome A: The classic “paperclip” scenario. Progress in atomic AI alignment doesn’t keep up with progress in AI capability. Transformative AI is unaligned with any user, as a result the future contains virtually nothing of value to us.
Outcome B: Progress in atomic AI alignment keeps up with progress in AI capability, but progress in social AI alignment doesn’t keep up. Transformative AI is aligned with a small fraction of the population, resulting in this minority gaining absolute power and abusing it to create an extremely inegalitarian future. Wars between different factions are also a concern.
Outcome C: Both atomic and social alignment keep with with AI capability. Transformative AI is aligned with society/humanity as a whole, resulting in a benevolent future for everyone.
Ideally, Outcome C is the outcome we want (with the exception of people who decided to gamble on being part of the elite in outcome B). Arguably, C > B > A (although it’s possible to imagine scenarios in which B < A). How does it translate into research priorities? This depends on several parameters:
The “default” pace of progress in each dimension: e.g. if we assume atomic AI alignment will be solved in time anyway, then we should focus on social AI alignment.
The inherent difficulty of each dimension: e.g. if we assume atomic AI alignment is relatively hard (and will therefore take a long time to solve) whereas social AI alignment becomes relatively easy once atomic AI alignment is solved, then we should focus on atomic AI alignment.
The extent to which each dimension depends on others: e.g. if we assume it’s impossible to make progress in social AI alignment without reaching some milestone in atomic AI alignment, then we should focus on atomic AI alignment for now. Similarly, some argued we shouldn’t work on alignment at all before making more progress in capability.
More precisely, the last two can be modeled jointly as the cost of marginal progress in a given dimension as a function of total progress in all dimensions.
The extent to which outcome B is bad for people not in the elite: If it’s not too bad then it’s more important to prevent outcome A by focusing on atomic AI alignment, and vice versa.
The OP’s conclusion seems to be that social AI alignment should be the main focus. Personally, I’m less convinced. It would be interesting to see more detailed arguments about the above parameters that support or refute this thesis.
It’s unclear to me how this particular outcome relates to social alignment (or at least to the kinds of research areas in this post). Some possibilities:
Does failure to solve social alignment mean that firms and governments cannot use AI to represent their shareholders and constituents? Why might that be? (E.g. what’s a plausible approach to atomic alignment that couldn’t be used by a firm or government?)
Does AI progress occur unevenly such that some group gets much more power/profit, and then uses that power? If so, how would technical progress on alignment help address that outcome? (Why would the group with power be inclined to use whatever techniques we’re imagining?) Also, why does this happen?
Does AI progress somehow complicate the problem of governance or corporate governance such that those organizations can no longer represent their constituents/shareholders? What is the mechanism (or any mechanism) by which this happens? Does social alignment help by making new forms of organization possible, and if so should I just be thinking of it as a way of improving those institutions, or is it somehow distinctive?
Do we already believe that the situation is gravely unequal (e.g. because governments can’t effectively represent their constituents and most people don’t have a meaningful amount of capital) and AI progress will exacerbate that situation? How does social alignment prevent that?
(This might make more sense as a question for the OP, it just seemed easier to engage with this comment since it describes a particular more concrete possibility. My sense is that the OP may be more concerned about failures in which no one gets what they want rather than outcome B per se.)
Outcome C is most naturally achieved using “direct democracy” TAI, i.e. one that collects inputs from everyone and aggregates them in a reasonable way. We can try emulating democratic AI via single user AI, but that’s hard because:
If the number of AIs is small, the AI interface becomes a single point of failure, an actor that can hijack the interface will have enormous power.
If the number of AIs is small, it might be unclear what inputs should be fed into the AI in order to fairly represent the collective. It requires “manually” solving the preference aggregation problem, and faults of the solution might be amplified by the powerful optimization to which it is subjected.
If the number of AIs is more than one then we should make sure the AIs are good at cooperating, which requires research about multi-AI scenarios.
If the number of AIs is large (e.g. one per person), we need the interface to be sufficiently robust that people can use it correctly without special training. Also, this might be prohibitively expensive.
Designing democratic AI requires good theoretical solutions for preference aggregation and the associated mechanism design problem, and good practical solutions for making it easy to use and hard to hack. Moreover, we need to get the politicians to implement those solutions. Regarding the latter, the OP argues that certain types of research can help lay the foundation by providing actionable regulation proposals.
Well, the OP did say:
I understood it as hinting at outcome B, but I might be wrong.
I’m not sure what’s most natural, but I do consider this a fairly unlikely way of achieving outcome C.
I think the best argument for this kind of outcome is from Wei Dai, but I don’t think it gets you close to the “direct democracy” outcome. (Even if you had state control and AI systems aligned with the state, it seems unlikely and probably undesirable for the state to be replaced with an aggregation procedure implemented by the AI itself.)
A lot depends on AI capability as a function of cost and time. On one extreme, there might enough rising returns to get a singleton: some combination of extreme investment and algorithmic advantage produces extremely powerful AI, moderate investment or no algorithmic advantage doesn’t produce moderately powerful AI. Whoever controls the singleton has all the power. On the other extreme, returns don’t rise much, resulting in personal AIs having as much or more collective power as corporate/government AIs. In the middle, there are many powerful AIs but still not nearly as many as people.
In the first scenario, to get outcome C we need the singleton to either be democratic by design, or have a very sophisticated and robust system of controlling access to it.
In the last scenario, the free market would lead to outcome B. Corporate and government actors use their access to capital to gain power through AI until the rest of the population becomes irrelevant. Effectively, AI serves as an extreme amplifier of per-existing power differentials. Arguably, the only way to get outcome C is enforcing democratization of AI through regulation. If this seems extreme, compare it to the way our society handles physical violence. The state has monopoly on violence, and with good reason: without this monopoly, upholding the law would be impossible. But, in the age of superhuman AI, traditional means of violence are irrelevant. The only important weapon is AI.
In the second scenario, we can manage without multi-user alignment. However, we still need to have multi-AI alignment, i.e. make sure the AIs are good at coordination problems. It’s possible that any sufficiently capable AI is automatically good at coordination problems, but it’s not guaranteed. (Incidentally, if atomic alignment is flawed then it might be actually better for the AIs to be bad at coordination.)
Thanks for the feedback, Vanessa. I’ve just written a follow-up post to better illustrate a class of societal-scale failure modes (“unsafe robust agent-agnostic processes”) that constitutes the majority of the probability mass I currently place on human extinction precipitated by transformative AI advancements (especially AGI, and/or high-level machine intelligence in the language of Grade et al). Here it is:
https://www.alignmentforum.org/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic
I’d be curious to see if it convinces you that what you call “social alignment” should be our main focus, or at least a much greater focus than currently.
Game-theoretically, there’s a better way. Assume that after winning the AI race, it is easy to figure out everyone else’s win probability, utility function and what they would do if they won. Human utility functions have diminishing returns, so there’s opportunity for acausal trade. Human ancestry gives a common notion of fairness, so the bargaining problem is easier than with aliens.
Most of us care some even about those who would take all for themselves, so instead of giving them the choice between none and a lot, we can give them the choice between some and a lot—the smaller their win prob, the smaller the gap can be while still incentivizing cooperation.
Therefore, the AI race game is not all or nothing. The more win probability lands on parties that can bargain properly, the less multiversal utility is burned.
Good point, acausal trade can at least ameliorate the problem, pushing towards atomic alignment. However, we understand acausal trade too poorly to be highly confident it will work. And, “making acausal trade work” might in itself be considered outside of the desiderata of atomic alignment (since it involves multiple AIs). Moreover, there are also actors that have a very low probability of becoming TAI users but whose support is beneficial for TAI projects (e.g. small donors). Since they have no counterfactual AI to bargain on their behalf, it is less likely acausal trade works here.
Yeah, I basically hope that enough people care about enough other people that some of the wealth ends up trickling down to everyone. Win probability is basically interchangeable with other people caring about you and your ressources across the multiverse. Good thing the cosmos is so large.
I don’t think making acausal trade work is that hard. All that is required is:
That the winner cares about the counterfactual versions of himself that didn’t win, or equivalently, is unsure whether they’re being simulated by another winner. (huh, one could actually impact this through memetic work today, though messing with people’s preferences like that doesn’t sound friendly)
That they think to simulate alternate winners before they expand too far to be simulated.