Most importantly, current proposed technical plans are necessary but not sufficient to stop this. Even if the technical side fully succeeds no one knows what to do with that.
I don’t think that’s quite accurate. In particular, gradual disempowerment is exactly the sort of thing which corrigibility would solve. (At least for “corrigibility” in the sense David and I use the term, and probably Yudkowsky, but not Christiano’s sense; he uses the term to mean a very different thing.)
A general-purpose corrigible AI (in the sense we use the term) is pretty accurately thought-of as an extension of the user. Building and using such an AI is much more like “uplifting” the user than like building an independent agent. It’s the cognitive equivalent of gaining prosthetic legs, as opposed to having someone carry you around on a sedan. Another way to state it: a corrigible subsystem acts like it’s a part of a larger agent, serving a particular purpose as a component of the larger agent, as opposed to acting like an agent in its own right.
… admittedly corrigibility is still very much in the “conceptual” stage, far from an actual technical plan. But it’s at least a technical research direction which would pretty directly address the disempowerment problem.
I agree, but it is important to note that the authors of the paper disagree here.
(It’s somewhat hard for me to tell if the crux is more that they don’t expect that everyone would get AI aligned to them (at least as representatives) even if this was technical feasible with zero alignment tax or if the crux is that even if everyone had single-single aligned corrigible AIs representing their interests and with control over their assets and power that would still result in disempowerment. I think it is more like second thing here.)
So Zvi is accurately representing the perspective of the authors, I just disagree with them.
Yes, Ryan is correct. Our claim is that even fully-aligned personal AI representatives won’t necessarily be able to solve important collective action problems in our favor. However, I’m not certain about this. The empirical crux for me is: Do collective action problems get easier to solve as everyone gets smarter together, or harder?
As a concrete example, consider a bunch of local polities in a literal arms race. If each had their own AGI diplomats, would they be able to stop the arms race? Or would the more sophisticated diplomats end up participating in precommitment races or other exotic strategies that might still prevent a negotiated settlement? Perhaps the less sophisticated diplomats would fear that a complicated power-sharing agreement would lead to their disempowerment eventually anyways, and refuse to compromise?
As a less concrete example, our future situation might be analogous to a population of monkeys who unevenly have access to human representatives which earnestly advocate on their behalf. There is a giant, valuable forest that the monkeys live in next to a city where all important economic activity and decision-making happens between humans. Some of the human population (or some organizations, or governments) end up not being monkey-aligned, instead focusing on their own growth and security. The humans advocating on behalf of monkeys can see this is happening, but because they can’t always participate directly in wealth generation as well as independent humans, they eventually become a small and relatively powerless constituency. The government and various private companies regularly bid or tax enormous amounts of money for forest land, and even the monkeys with index funds eventually are forced to sell, and then go broke from rent.
I admit that there are many moving parts of this scenario, but it’s the closest simple analogy to what I’m worried about that I’ve found so far. I’m happy for people to point out ways this analogy won’t match reality.
I don’t think that’s quite accurate. In particular, gradual disempowerment is exactly the sort of thing which corrigibility would solve. (At least for “corrigibility” in the sense David and I use the term, and probably Yudkowsky, but not Christiano’s sense; he uses the term to mean a very different thing.)
A general-purpose corrigible AI (in the sense we use the term) is pretty accurately thought-of as an extension of the user. Building and using such an AI is much more like “uplifting” the user than like building an independent agent. It’s the cognitive equivalent of gaining prosthetic legs, as opposed to having someone carry you around on a sedan. Another way to state it: a corrigible subsystem acts like it’s a part of a larger agent, serving a particular purpose as a component of the larger agent, as opposed to acting like an agent in its own right.
… admittedly corrigibility is still very much in the “conceptual” stage, far from an actual technical plan. But it’s at least a technical research direction which would pretty directly address the disempowerment problem.
I agree, but it is important to note that the authors of the paper disagree here.
(It’s somewhat hard for me to tell if the crux is more that they don’t expect that everyone would get AI aligned to them (at least as representatives) even if this was technical feasible with zero alignment tax or if the crux is that even if everyone had single-single aligned corrigible AIs representing their interests and with control over their assets and power that would still result in disempowerment. I think it is more like second thing here.)
So Zvi is accurately representing the perspective of the authors, I just disagree with them.
Yes, Ryan is correct. Our claim is that even fully-aligned personal AI representatives won’t necessarily be able to solve important collective action problems in our favor. However, I’m not certain about this. The empirical crux for me is: Do collective action problems get easier to solve as everyone gets smarter together, or harder?
As a concrete example, consider a bunch of local polities in a literal arms race. If each had their own AGI diplomats, would they be able to stop the arms race? Or would the more sophisticated diplomats end up participating in precommitment races or other exotic strategies that might still prevent a negotiated settlement? Perhaps the less sophisticated diplomats would fear that a complicated power-sharing agreement would lead to their disempowerment eventually anyways, and refuse to compromise?
As a less concrete example, our future situation might be analogous to a population of monkeys who unevenly have access to human representatives which earnestly advocate on their behalf. There is a giant, valuable forest that the monkeys live in next to a city where all important economic activity and decision-making happens between humans. Some of the human population (or some organizations, or governments) end up not being monkey-aligned, instead focusing on their own growth and security. The humans advocating on behalf of monkeys can see this is happening, but because they can’t always participate directly in wealth generation as well as independent humans, they eventually become a small and relatively powerless constituency. The government and various private companies regularly bid or tax enormous amounts of money for forest land, and even the monkeys with index funds eventually are forced to sell, and then go broke from rent.
I admit that there are many moving parts of this scenario, but it’s the closest simple analogy to what I’m worried about that I’ve found so far. I’m happy for people to point out ways this analogy won’t match reality.
zero alignment tax seems less than 50% likely to me