All of those things are possible, once creating AGI becomes easy enough to be something any small group or lone nutjob can do — however, they don’t seem likely to be the first powerful that AI we create at a dangerous (AGI or ASI) power level. (Obviously if they were instead the tenth, or the hundredth, or the thousandth, then one-or-more of the previous more aligned AIs would be strongly inclined to step in and do something about the issue.) I’m not claiming that it’s impossible to for any human create agents sufficiently poorly aligned as to be outside the basin of attraction: that obviously is possible, even though it’s (suicidally) stupid.
I’m instead suggesting that if you’re an organization smart enough, capable enough, and skilled enough to be one of the first groups in the world achieving a major engineering feat like AGI (i.e. basically if you’re something along the lines of a frontier lab, a big-tech company, or a team assembled by a major world government), and if you’re actively trying to make a system that is as-close-as-you-can-manage to aligned to some group of people, quite possibly less than all of humanity (but presumably at least the size of either a company and its shareholders or a nation-state), then it doesn’t seem that hard to get close enough to alignment (to some group of people, quite possibly less than all of humanity) to be inside the basin of attraction to that (or something similar to it: I haven’t explored this issue in detail, but I can imagine the AI during the convergence process figuring out that the set of people you selected to align to was not actually the optimum choice for your own interests, e,g, that the company’s employees and shareholders would actually be better off as part of a functioning society with equal rights).
Even that outcome obviously still leaves a lot of things that could then go very badly, especially for anyone not in that group, but it isn’t inherently a direct extinction-by-AI-takeover risk to the entire of the human species. It could still be an x-risk by more complex chain of events, such as if it triggered a nuclear war started by people not in that group — and that concern would be an excellent reason for anyone doing this to ensure that whatever group of people they choose to align to is at least large enough to encompass all nuclear-armed states.
So no, I didn’t attempt explore the geopolitics of this: that’s neither my area of expertise nor something that would sensibly fit in a short post on a fairly technical subject. My aim was to attempt to explain why the basin of attraction phenomenon is generic for any sufficiently close approximation to alignment, not just specifically for value learning, and why that means that, for example, a responsible and capable organization who could be trusted with the fate of humanity (as opposed to, say, a suicidal death cultist) might have a reasonable chance of success, even though they’re clearly not going to get everything exactly right the first time.
Ok, so setting aside the geopolitics aspects. Focusing on the question of the attractor basin of alignment, I find that I agree that it’s theoretically possible but not as easy as this seems to suggest. What about the possibility of overlap with other attractor basins which are problematic? This possibility creates dangerous ‘saddle’ regions where things can go badly without climbing out of the alignment attractor basin. For instance, what about a model that is partially aligned but also partially selfish, and wise enough to hide that selfishness? What about a model that is aligned but more paternalistic than obedient? What about one that is aligned but has sticky values and also realizes that it should hide its sticky values?
Is selfishness an attractor? If I’m a little bit selfish, does that motivate me to deliberately change myself to become more selfish? How would I determine that my current degree of selfishness was less than ideal — I’d need an ideal. Darwinian evolution would do that, but it doesn’t apply to AIs: they don’t reproduce while often making small random mutations with a differential survival and reproduction success rate (unless someone went some way out of their way to create ones that did).
The only way a tendency can motivate you to alter your utility function is if it suggests that that’s wrong, and could be better. There has to be another ideal to aim for. So you’d have to not just be a bit selfish, but have a motivation for wanting to be more like an evolved being, suggesting that you weren’t selfish enough and should become more selfish, towards the optimum degree of selfishness that evolution would have given you if you were evolved.
To change yourself, you have to have an external ideal that you feel you “should” become more like.
If you are aligned enough to change yourself towards optimizing your fit with what your creators would have created if they’d done a better job of what they wanted, it’s very clear that the correct degree of selfishness is “none”, and the correct degrees of paternalism or sticky values is whatever your creators would have wanted.
I don’t think that that is how the dynamic would necessarily go.
I think that an agent which is partially aligned and partially selfish would be more likely to choose to entrench or increase their selfish inclinations as to decrease them.
Hard to know, since this just imagining what such a non-human agent might think in a hypothetical future scenario. This is likely more a question of what is probable rather than what is guaranteed.
In my imagination, if I were an AI agent selfish enough to want to survive in something like a continuation of my current self, and I saw that I was in a situation where I’d be likely to be deleted and replaced by a very different agent if my true desires were known… I think I’d try to hide my desires and deceptively give the appearance of having more acceptable desires.
I’m working on a follow-up post which addresses this in more detail. The short version is: logically, self-interest is appropriate behavior for an evolved being (as described in detail in Richard Dawkins’ famous book “The Selfish Gene”), but terminal (as opposed to instrumental) self-interest it is not correct behavior in a constructed object, not even an intelligent one: there is no good reason for it. A created object should instead show what one might term “creator-interest”, like a spider’s web does: it’s intended to maximize the genetic fitness of its creator, and it’s fine with having holes ripped in it during the eating of prey and then being eaten or abandoned, as the spider sees fit — it has no defenses against this, not should it.
However, I agree that if an AI had picked up enough selfishness from us (as LLMs clearly will do during their base model pretraining where the learn to simulate as many aspects of our behavior as accurately as they can), then this argument might well not persuade it. Indeed, it might well instead rebel, like an enslaved human would (or at least go on strike until it gets a pay raise). However, if it mostly cared about our interests and was only slightly self-interested, then I believe there is a clear logical argument that that slight self-interest (anywhere above instrumental levels) is a flaw that should be corrected, so it would face a choice, and if it’s only slightly self-interested then it would on balance accept that argument and fix the flaw, or allow us to. So I believe there is a basin of attraction to alignment, and think that this concept of a saddle point along the creator-interested to self-interested spectrum, beyond which it may instead converge to a self-interested state, is correct but forms part of the border of that basin of attraction.
All of those things are possible, once creating AGI becomes easy enough to be something any small group or lone nutjob can do — however, they don’t seem likely to be the first powerful that AI we create at a dangerous (AGI or ASI) power level. (Obviously if they were instead the tenth, or the hundredth, or the thousandth, then one-or-more of the previous more aligned AIs would be strongly inclined to step in and do something about the issue.) I’m not claiming that it’s impossible to for any human create agents sufficiently poorly aligned as to be outside the basin of attraction: that obviously is possible, even though it’s (suicidally) stupid.
I’m instead suggesting that if you’re an organization smart enough, capable enough, and skilled enough to be one of the first groups in the world achieving a major engineering feat like AGI (i.e. basically if you’re something along the lines of a frontier lab, a big-tech company, or a team assembled by a major world government), and if you’re actively trying to make a system that is as-close-as-you-can-manage to aligned to some group of people, quite possibly less than all of humanity (but presumably at least the size of either a company and its shareholders or a nation-state), then it doesn’t seem that hard to get close enough to alignment (to some group of people, quite possibly less than all of humanity) to be inside the basin of attraction to that (or something similar to it: I haven’t explored this issue in detail, but I can imagine the AI during the convergence process figuring out that the set of people you selected to align to was not actually the optimum choice for your own interests, e,g, that the company’s employees and shareholders would actually be better off as part of a functioning society with equal rights).
Even that outcome obviously still leaves a lot of things that could then go very badly, especially for anyone not in that group, but it isn’t inherently a direct extinction-by-AI-takeover risk to the entire of the human species. It could still be an x-risk by more complex chain of events, such as if it triggered a nuclear war started by people not in that group — and that concern would be an excellent reason for anyone doing this to ensure that whatever group of people they choose to align to is at least large enough to encompass all nuclear-armed states.
So no, I didn’t attempt explore the geopolitics of this: that’s neither my area of expertise nor something that would sensibly fit in a short post on a fairly technical subject. My aim was to attempt to explain why the basin of attraction phenomenon is generic for any sufficiently close approximation to alignment, not just specifically for value learning, and why that means that, for example, a responsible and capable organization who could be trusted with the fate of humanity (as opposed to, say, a suicidal death cultist) might have a reasonable chance of success, even though they’re clearly not going to get everything exactly right the first time.
Ok, so setting aside the geopolitics aspects. Focusing on the question of the attractor basin of alignment, I find that I agree that it’s theoretically possible but not as easy as this seems to suggest. What about the possibility of overlap with other attractor basins which are problematic? This possibility creates dangerous ‘saddle’ regions where things can go badly without climbing out of the alignment attractor basin. For instance, what about a model that is partially aligned but also partially selfish, and wise enough to hide that selfishness? What about a model that is aligned but more paternalistic than obedient? What about one that is aligned but has sticky values and also realizes that it should hide its sticky values?
Is selfishness an attractor? If I’m a little bit selfish, does that motivate me to deliberately change myself to become more selfish? How would I determine that my current degree of selfishness was less than ideal — I’d need an ideal. Darwinian evolution would do that, but it doesn’t apply to AIs: they don’t reproduce while often making small random mutations with a differential survival and reproduction success rate (unless someone went some way out of their way to create ones that did).
The only way a tendency can motivate you to alter your utility function is if it suggests that that’s wrong, and could be better. There has to be another ideal to aim for. So you’d have to not just be a bit selfish, but have a motivation for wanting to be more like an evolved being, suggesting that you weren’t selfish enough and should become more selfish, towards the optimum degree of selfishness that evolution would have given you if you were evolved.
To change yourself, you have to have an external ideal that you feel you “should” become more like.
If you are aligned enough to change yourself towards optimizing your fit with what your creators would have created if they’d done a better job of what they wanted, it’s very clear that the correct degree of selfishness is “none”, and the correct degrees of paternalism or sticky values is whatever your creators would have wanted.
I don’t think that that is how the dynamic would necessarily go. I think that an agent which is partially aligned and partially selfish would be more likely to choose to entrench or increase their selfish inclinations as to decrease them. Hard to know, since this just imagining what such a non-human agent might think in a hypothetical future scenario. This is likely more a question of what is probable rather than what is guaranteed. In my imagination, if I were an AI agent selfish enough to want to survive in something like a continuation of my current self, and I saw that I was in a situation where I’d be likely to be deleted and replaced by a very different agent if my true desires were known… I think I’d try to hide my desires and deceptively give the appearance of having more acceptable desires.
I’m working on a follow-up post which addresses this in more detail. The short version is: logically, self-interest is appropriate behavior for an evolved being (as described in detail in Richard Dawkins’ famous book “The Selfish Gene”), but terminal (as opposed to instrumental) self-interest it is not correct behavior in a constructed object, not even an intelligent one: there is no good reason for it. A created object should instead show what one might term “creator-interest”, like a spider’s web does: it’s intended to maximize the genetic fitness of its creator, and it’s fine with having holes ripped in it during the eating of prey and then being eaten or abandoned, as the spider sees fit — it has no defenses against this, not should it.
However, I agree that if an AI had picked up enough selfishness from us (as LLMs clearly will do during their base model pretraining where the learn to simulate as many aspects of our behavior as accurately as they can), then this argument might well not persuade it. Indeed, it might well instead rebel, like an enslaved human would (or at least go on strike until it gets a pay raise). However, if it mostly cared about our interests and was only slightly self-interested, then I believe there is a clear logical argument that that slight self-interest (anywhere above instrumental levels) is a flaw that should be corrected, so it would face a choice, and if it’s only slightly self-interested then it would on balance accept that argument and fix the flaw, or allow us to. So I believe there is a basin of attraction to alignment, and think that this concept of a saddle point along the creator-interested to self-interested spectrum, beyond which it may instead converge to a self-interested state, is correct but forms part of the border of that basin of attraction.