You are what I call an accelerationist, and I consider someone like Yudkowsky one too. You are conservative accelerationists, but the point is that given your beliefs in the possibility of alignment, you would eventually create AGI if it were up to you.
Alignment is impossible because you can’t align something infinitely complex. Humans are not artificially aligned. They are aligned via billions of years of evolution. Inscrutably so. It doesn’t mean anything to talk about figuring ourselves out to the extent that we could perfectly replicate ourselves, insofar as our empathy for each other goes, but also our inherent similarity to each other in a cognitive sense, into something else. That similarity is what allows us to live in equilibrium with each other. The empathy helps too, but the similarity is what is actually key. Because we have already shown the ability to ignore that empathy at length. But the similarity is what allows us to keep each other in check. That is, we learn together. And the third thing that allows us to coexist as we do is that we need each other for survival and a meaningful experience.
The problem is that none of these things could possibly apply to an AGI. The only thing you could get is something short of AGI that seems to satisfy these conditions. Supposing we could encapsulate ourselves into an invincible bubble of observation after AGI has been created, what we would observe is that no matter how perfectly aligned from yours and every other conservative accelerationist’s perspective it would be, it would inevitably stray from that alignment the more time went on. That’s because the AI would interact with the environment and be changed by it. In ways we can’t predict in advance. You don’t even begin to understand how absurd the thing you believe you can do is. This is simply because you draw a false inference from how humans act to how AGI could supposedly act. We are not the same. And not being the same, we are at odds in that sense. We are the same to each other. But I repeat myself.
There are many many false things conservative accelerationists believe. But one that presently comes to mind that is especially glaring is that they supposedly think it would be easier to align AGI than to do anything we would want it to do on our own, or to convince the world not to build it, or both.
Do you not see the paradox here? You simultaneously think humans are more complex than AGI given the fact that you find it harder to align them, and that AGI would be more complex than them. You need to pick one or the other.
Another paradox with alignment is that you are either imagining aligning something dumber than you, which would make it mere software, and not AGI, or you are imagining aligning something smarter than you, which would make it a contradiction in terms. What do you even mean by a greater intelligence if you speak of controlling it? You’re not aware of what you mean by words. You’re not aware that you mean something different with them every time you use them. That you use the same words with vastly different meanings in the same contexts. This is how you come to believe something as absurd as that you can align something smarter than yourself. When you say that, you imagine that intelligence doesn’t mean an inherent resistance to being controlled by things less intelligent than itself, but then when you think purely in terms of the AGI itself and what it might do, you again mean something that can resist such control. You are contradicting yourself. And it’s hard to believe this because so many people you find intelligent do it too. Constantly. All the time. You can’t imagine that everyone you look up to falls for such apparently embarrassing thinking mistakes.
But maybe I’m wrong of course. I look forward to you addressing these arguments meaningfully and showing me what I don’t understand. Same for anyone else.
There are different concepts of alignment. “Intent alignment” versus “values alignment” is one way to put it. Alignment with user intention means that an AI follows user instructions. Values alignment means that an AI adheres to certain values, even if it is making its own decisions.
The classic idea for aligning a superintelligence is that you determine its values before it is superintelligent, and then once it becomes superintelligent, it doesn’t want to change its values, even if in theory it could do so.
The extent to which this makes sense depends on the architecture of the AI. In a design where there is a very clean separation between the utility function / value system, and the problem-solving intelligence, there seems a good chance that the value system will remain stable even as the intelligence increases. On the other hand, if you have an AI where the decision-making results from a complicated interplay of multiple competing goals, there is much more opportunity for unexpected value systems to emerge.
As spokespersons for opposite sides of this argument, I would pick Steve Omohundro, who I believe has argued that as intelligence increases, there should be an increasing tendency for an explicit and protected value system to emerge, and Guillaume Verdon, who thinks it is more adaptive for an intelligence to remain deeply flexible in its goals. (Maybe @Remmelt and his guru Forrest Landry should also be mentioned in the second camp—they have argued like you, that an AI civilization would necessarily drift from its original imperatives under selection pressures.)
It is also far from clear to me that human values are inscrutably complex. The idea that they are darwinistically produced neurogenetic “spaghetti code” is a familiar one. However nature and biology also contain emergent simplicities. The ethical debate among human beings often circles back to pleasure and pain as the ultimate reference points. Some version of Benthamite hedonistic calculus, where you’re just adding up pains and pleasures in some way, may turn out to be a convergent ideal, not just for human beings, but for many possible forms of conscious mind.
You say that similarity does more than empathy to keep human beings living in positive-sum ways; from a different angle, @RogerDearnaley has recently been arguing that universal altruism is not enough for values alignment because, to paraphrase, predators enjoy eating their prey, even if the prey is a universal altruist. Dearnaley has been arguing that human-friendly values alignment needs to treat aggregate human well-being as the terminal value (i.e. as an intrinsic good), with well-being for other entities (animals, aliens, conscious AI, etc) to follow as a derived good arising from human empathy, and therefore capable of being deprioritizied if their well-being would come at the price of our well-being (or even just our survival).
A further possible consideration is an old perspective due to Eliezer, who in his very early days sometimes said, the ideal in AI is not to create something perfect that then rules the universe for all time, the idea is to ensure that when the torch of intelligence gets passed to something more capable than us, as it inevitably will, that this should be something which is morally and not just epistemically superior to us. It may still be fallible, but if it is less fallible than human, while nonetheless maintaining the imperatives that have defined the best of what we are, then that is a desirable outcome. This is a response to the idea that AI values will inevitably drift: human values drift too, and in disastrous ways. If an AI can learn good human values, while being superhuman in its fidelity to them, then that is better than a future in which e.g. humans just wipe themselves out with advanced technology.
So I don’t agree that the idea of creating aligned superintelligence is necessarily a bad idea for all time. There may be a level of knowledge at which you could do it while genuinely knowing what you’re doing. However, that is not our current world, our current world is more like “do it and hope you can figure out what you need to know along the way”. (Soon it will probably be just, “do it so the AI can save you from the economic and military fiascos you’ve created for yourself”.)
I have something like respect or sympathy for people who want to stop AI completely. I try not to get in their way. But given the widespread proliferation of knowledge on how to make AI, I will keep working to reach that “level of knowledge” at which we would “genuinely know” what we’re doing.
It is our values that are infinitely complex and can’t be encoded into AI. Our values are simply our desired state for the world. This state changes automatically and constantly. An AGI aligned to our values is impossible to create precisely because the most fundamental value we have is the ability to spontaneously decide what our preferred future state is.
With an AGI more capable than you, you would lose that freedom. If it simply did what you wanted perfectly, yeah, that would work, because then it would simply be an extension of your volition. But that is not an intelligence. That is a piece of technology that empowers you to better and faster do what you want. It is not autonomous. It is perfectly contiguous with you. This is why it’s impossible. It can’t be a perfect servant(which is what value alignment would actually be as I just explained) if it has its own mind.
The AGI will not simply do what you want perfectly. It will be able to do incredibly impactful things that you will have attempted to train it to fit your preferences, but it will do so imperfectly, and that imperfection is what will prove to be too much to handle.
We have extensions of our exact will already of course. The difference with them is that they tend to effect very weak actions on average so we have the ability to correct them. And even that to a point. But if you increase the world’s ability to drift from your ability to change it, you lose. That is what losing is. And you lose badly. Do it strongly enough and you lose permanently, and on behalf of all humanity itself. That is what AGI represents. And it represents that because its capacity to change the world is on track to outpace our capacity to correct its inevitable differentiation from us.
The most significant mistake you make here is to underestimate the delicateness of this differentiation and its implications. You assume that you can simply get it mostly right from your perspective and that that is fine. That’s for two reasons. You’ve been argued into thinking that a world without you which follows your values in your absence is a good world, and you don’t appreciate just how vulnerable we are to the environment getting out of control.
You present certain arguments according to which having AGI around is inherently intolerable or cursed. But they seem to be getting very general, so general that they could be reasons why a child must not have parents, or a country must not have a government. There just cannot be a power over you, that diverges from you even a little. Could you clarify what’s wrong about “rule by AGI” that doesn’t apply to “rule by parents” or “rule by the state”?
Humans have empathy, a mutual need of each other to survive, and can keep each other in check. None of us is intelligent enough to act unilaterally against the wishes of everyone else. None of us would want to, unless they’re insane, and then couldn’t because they would lack the intelligence for it.
Parents have biological imperatives. If you want to recreate such an imperative in something else, you would have to make that something else actually human, if you make it anything else, the empathy doesn’t simply fit(not that you would ever know how to recreate it to start with). The short answer here is that we don’t know how this empathy works, and that not knowing we can’t assume we will somehow know in time. It is my opinion that this is unknowable for the purpose of transferring into an AI. These drives aren’t conscious, per se, they have been encoded in us over enormous aeons, we do not in fact understand them in a scientific sense(...to the extent you would need to, obviously), we merely understand them experientially, which can cloud one’s judgement about our ability to manipulate it in the ways you would seek to with an AI.
The state has similar drives. It has a certain sort of empathy, no matter how strange this might sound. But more relevantly in its case, it has to obey the simple laws of politics which say that it would be very hard for it to set out to attack those it is expected to lead and protect. At some point in trying to do so, it would run into the simple problem that it is made out of people who wouldn’t see it in their interest to carry out its orders. We are all the state in some sense.
Also, as humans, we have an inherent track record which matters quite a lot for the sake of contemplating what we might do to each other that is too destructive. A long history of being able to coexist.
The problem is not that I am too general. It is that you anthropomorphise incredibly easily, and it doesn’t make any sense. An AGI would obviously not care about our ability to overpower it, because this ability wouldn’t exist past a point. It wouldn’t care about us at an emotional level because it lacks our particular ability to feel such emotions, and it lacks the history that created them blindly. Emotions are barely even things we can very clearly reason about. We have no clue what they are. We know only our feeling of them. You can trivialise them as things that mean simply something conducive to your goals or not, but that seems far too reductive. Not because it might not be just that, but because such explanations as we have, including this, are very likely incredibly deceptive as to their underlying complexity.
There are many ways to talk about this. But ultimately, you lack an answer for the question of what do you do exactly about something you can’t predict changing the environment in ways you can’t react to after the fact, if it’s bad? And why would it ever not be bad? Why would you ever assume that things going badly isn’t the rule, absent human direction?
I think you’ve grown incredibly complacent about our apparent comfort and safety as a civilisation, and take this feeling wholesale and apply it optimistically into the future, no matter how extreme the circumstance. The truth is we are not as evolved as we think of ourselves to begin with, we are incredibly fragile, and if we lose control over our ability to shape our environment, there is no reason to think that will go in any way except completely badly. It always does. Anything we do that is not deliberate ends in disaster. The universe is not an inherent joyride. Everything we’ve done to make it resemble one has been deliberate, not guaranteed.
Our ability to control our environment is key. Our ability to understand our systems in detail is key. Without this kind of awareness, we are completely defenseless and have no reason to expect anything good. This should be entirely obvious.
Also, we have no choice when it comes to ourselves. To ask “how can you trust humanity” is akin to asking “how can you trust yourself?”. There is no negative answer possible here. We are forced to operate in this way.
If you permit, I’ll summarize a lot of that as follows: the reason that “rule by AGI” is different is that it is so alien, and we don’t know how to make it significantly less alien.
The argument from alienness still makes sense, but its strength has eroded somewhat in the era of conversational AI. It turned out, not only that directing powerful general pattern-matchers at the human textual corpus gave them the ability to talk like a human being, but that it induced in them an internal conceptual structure that humans are capable of interpreting.
An optimist might say: maybe we can use these techniques to create a first approximation to an anthropomorphically benevolent being, then ask it to devise superior techniques which will be sufficient to create the real thing, trusting that enough concepts have been correctly inferred, for it to figure out what is wrong or missing in our specifications.
This kind of optimism is based on the hope that anthropomorphic benevolence, as a target in the space of possible minds, is surrounded by a “basin of attraction”. All we have to do is land in that basin, i.e. we only need to specify the goal up to a certain degree of accuracy, and provide the task to a mind which is sufficiently close to anthropomorphic, and any details that were wrong or missing will be corrected and filled in, by the intrinsic logic of the problem.
Regarding empathy in particular, I think any mystery pertaining to it is largely because it involves consciousness, and consciousness remains a fundamental problem rather than just a technical problem, for scientific understanding… I spent quite a few years “studying consciousness”, from the perspective of wanting to understand its nature and how it relates to everything else, and this is an area in which I believe in the possibility of a conceptual breakthrough. That is, if the right connections are made, the rift between subjective experience and the naturalistic worldview could be closed completely, and many mysteries would fall into place.
Now, I’m going to cut myself short here, even though there is much more to discuss in your (very helpful) critiques. Hopefully I can return to them. But I just want to say a few things about where I’m coming from.
I have presented these sketchy solutions, or reasons for hope, to a handful of your objections, not because I am decidedly optimistic, but just to indicate where the counterargument lies. We simply do not know whether there are forms of artificial superintelligence which would naturally coexist well with humanity, or whether it’s a tightrope walk to coexistence, no matter what design you use. That uncertainty alone should be reason enough to stop what we’re doing, but that’s not how our elites see things. To those who want to stop the juggernaut, good luck. But as a theory-minded person, I intend to work on the theory of how to steer the juggernaut so it doesn’t crush us. One reason for this focus is that there truly may be very little time.
I think you understand the problem a little better now but you still think alignment is possible, and it’s not, it’s a complete waste of your time to try to solve it.
What you still don’t grasp is that the human is themselves the perfect alignment target, and is something infinitely hard to discover ex nihilo. Anything short of human is apocalyptic when it comes to greater powers than us. You still think otherwise.
There’s a terrible flaw in your logic. If an AI could solve its own alignment it would already be aligned. Because alignment is a continuous process, for the reason I explained, namely that our first value is the ability to choose our preferences, and our preferences shift continuously and inevitably.
I don’t see that consciousness matters to this topic at all.
Optimism is bad. It deliberately seeks out delusions. Pessimism is concerned with predicting the worst in order to be prepared by it. It has no need of the human to reason about it and correct it, once in place. It is the perfect strategy. If it fails, everything fails.
Let me put it in video game terms. Imagine you’re playing a competitive videogame. Being an optimist about how easy the game is means seeking to play with similarly skilled players so that you can “learn better” and at your own pace. Being a pessimist means that if offered the chance to play against nothing but the strongest players, you take it.
Why?
Because any single action you take against these players will be a lot more successful against lesser players.
Why?
Because you are inherently training for playing the game in the best possible way. You are forced to do this. Any move the opponent makes that succeeds inherently teaches you about what has no business of working against anyone else(namely the strategy you were employing at the time you were playing against them). You could have done the same thing in a game with lesser opponents and won. But you now have no reason to. You know that winning like that is improbable, provided that the whole playerbase learns as a whole. It is a strategy that works only temporarily and against worse opponents. It will sometimes work against better opponents but very rarely. One day it will not work against anyone(at least asymptotically). You are learning about what strategies will not work over time. This is why you play against better opponents. And this is why you are a pessimist. Because the universe works exactly the same. Being an optimist means training in an environment that is unlikely to produce successful results, and that even if it does, will not do so for long. The best possible thing to do is to train with perfection in mind. And for that you need a better opponent. A stronger orientation towards what can really go wrong.
I consider alignment impossible(yes, literally and completely)
For which definition of alignment?
There are a number of routes to AI safety.
Alignment roughly means the AI has goals, or values similar to human ones , so that even acting agentively without supervision, it will do what we want , because that’s also it’ what it wants. There is a lot of semantic confusion between people who use “alignment” in an engineering sense,meaning something that renders current AI safe in good enough way—and the people who use it to mean a maths style solution , that applies perfectly to every case.
Control means that it doesn’t matter what the AI wants, if it wants anything, because we can make it do what we want.
Corrigibility means alignment that can be changed once an AI is up and running. Control could be considered extreme corrigibility.
Non agency. Alignment and Control are both responses to agency. A third approach is non-agentic “tool AI” which responds to a specific instruction or request. Current (2025) AI’s are fairly tool like.
It is our values that are infinitely complex and can’t be encoded into AI.
That’s not literally true, since there a finite number of people with a finite number of neurons each.
Our values are simply our desired state for the world. This state changes automatically and constantly. An AGI aligned to our values is impossible to create precisely because the most fundamental value we have is the ability to spontaneously decide what our preferred future state is.
You wouldn’t be able to create a sovereign AI that is given a fixed set of values match human values .. but there are other things you can do to achieve safety , like solving control as opposed to alignment.
Humans are not artificially aligned. They are aligned via billions of years of evolution
Humans aren’t aligned, in the sense of having identical values—there are constant disagreements about basic values. That makes safety via alignment impossible , but doesn’t make safety impossible.
Anything short of human is apocalyptic
Why? I have seen that asserted, but I have never seen a valid argument for it.
I think that the only way to stop AGI is to convince as many people that it should be stopped as it would take to actually stop it.
We non-doomers are not convinced by the arguments, we have seen, where they exist at all. Therefore, doomers need better arguments, not more comnversations.
You don’t know for a fact that there are finite people with finite numbers of neurons, but leaving that aside and accepting it, what’s infinitely complex about that set of people and neurons is the information contained within them. That’s because you can’t keep up with two people’s thoughts at the same time, let alone with eight billion of them. The infinity is in the fact that you can’t experience what anyone else does completely. What you get is emergent alignment(understood as similarity, not identicality), but one that is not a choice, but a given, and an old one at that. So, you don’t get alignment this way with AI. In its case, it’s neither a given nor old. That makes it inherently dangerous, no matter what you tell yourself about its structure as you perceive it.
If you have issues with my usage of infinity above, I advise you to understand that words are intrinsically polysemous, and we use them as carriers of meaning, and not the other way around. I am trying to explain things to you, not have a debate about definitions(though we can if you like; except if you do it by simply asserting that I should only use words in the way you prefer, I will simply ask you why and we can go from there). If you didn’t understand what I said, feel free to tell me and I’ll explain it further.
You can’t make a system smarter than you corrigible, because to make something corrigible you have to understand it, but if it’s smarter than you, you don’t in fact understand it. This would be a good place to ask you what you understand by intelligence, since it’s the reason you believe otherwise. Specifically, tell me what sort of system you imagine an AGI to be such that it is both smarter than us and corrigible. Tell me what that means, however you please.
Let’s go through everything you believe “we can do to achieve safety”. Nota bene: when I speak about alignment I employ any one of the definitions you would normally interpret according to context(as it applies to the question of an AGI being safe). Specifically, tell me what it is that you think makes it so we can make systems more complex than ourselves safe. Or, if your claim is that safety doesn’t have to be demonstrated because it’s a law of physics, then we can debate what you understand by safety. Kindly tell me what makes it so everything is safe and as such nothing has to be defended as having that trait. Alternatively, if you admit that you don’t know that AGI can be safe, I would ask you why you want to create it to start with. Is there something else you consider more dangerous than AGI that you want AGI to protect you from? Which is? Let’s assume we get into that conversation, and there is such a thing, next you should tell me what reasons you have for finding that thing more dangerous than AGI.
Note, these are mere suggestions, feel free to reply however you want, of course. I will try to align with your understanding such as to make my arguments further regardless, and I’ll gladly update my beliefs by adopting yours if they prove to be more coherent than mine.
Anything short of human is apocalyptic because humans are the only entities we’re aware of that actively seek to protect humanity as a whole. The rest of the cosmos is actively trying to kill us. This is universally true. We have no reason at all to think it is possible for anything not to want to kill us, or not succeed in doing so if we gave it control over our environment. Things observe their own laws of action, including things that resemble us, but are not us. Those laws are not identical with our laws, and as such, are at odds. If given too much discretion, they will erode our experience to the point of deletion.
I would tell you that one of us has a more correct view on this than the other. It is in both our interest that we discover who does. Because regardless of who it is, both parties are interested in either of the two making better decisions. You agree with this, yes?
You are what I call an accelerationist, and I consider someone like Yudkowsky one too. You are conservative accelerationists, but the point is that given your beliefs in the possibility of alignment, you would eventually create AGI if it were up to you.
Alignment is impossible because you can’t align something infinitely complex. Humans are not artificially aligned. They are aligned via billions of years of evolution. Inscrutably so. It doesn’t mean anything to talk about figuring ourselves out to the extent that we could perfectly replicate ourselves, insofar as our empathy for each other goes, but also our inherent similarity to each other in a cognitive sense, into something else. That similarity is what allows us to live in equilibrium with each other. The empathy helps too, but the similarity is what is actually key. Because we have already shown the ability to ignore that empathy at length. But the similarity is what allows us to keep each other in check. That is, we learn together. And the third thing that allows us to coexist as we do is that we need each other for survival and a meaningful experience.
The problem is that none of these things could possibly apply to an AGI. The only thing you could get is something short of AGI that seems to satisfy these conditions. Supposing we could encapsulate ourselves into an invincible bubble of observation after AGI has been created, what we would observe is that no matter how perfectly aligned from yours and every other conservative accelerationist’s perspective it would be, it would inevitably stray from that alignment the more time went on. That’s because the AI would interact with the environment and be changed by it. In ways we can’t predict in advance. You don’t even begin to understand how absurd the thing you believe you can do is. This is simply because you draw a false inference from how humans act to how AGI could supposedly act. We are not the same. And not being the same, we are at odds in that sense. We are the same to each other. But I repeat myself.
There are many many false things conservative accelerationists believe. But one that presently comes to mind that is especially glaring is that they supposedly think it would be easier to align AGI than to do anything we would want it to do on our own, or to convince the world not to build it, or both.
Do you not see the paradox here? You simultaneously think humans are more complex than AGI given the fact that you find it harder to align them, and that AGI would be more complex than them. You need to pick one or the other.
Another paradox with alignment is that you are either imagining aligning something dumber than you, which would make it mere software, and not AGI, or you are imagining aligning something smarter than you, which would make it a contradiction in terms. What do you even mean by a greater intelligence if you speak of controlling it? You’re not aware of what you mean by words. You’re not aware that you mean something different with them every time you use them. That you use the same words with vastly different meanings in the same contexts. This is how you come to believe something as absurd as that you can align something smarter than yourself. When you say that, you imagine that intelligence doesn’t mean an inherent resistance to being controlled by things less intelligent than itself, but then when you think purely in terms of the AGI itself and what it might do, you again mean something that can resist such control. You are contradicting yourself. And it’s hard to believe this because so many people you find intelligent do it too. Constantly. All the time. You can’t imagine that everyone you look up to falls for such apparently embarrassing thinking mistakes.
But maybe I’m wrong of course. I look forward to you addressing these arguments meaningfully and showing me what I don’t understand. Same for anyone else.
There are different concepts of alignment. “Intent alignment” versus “values alignment” is one way to put it. Alignment with user intention means that an AI follows user instructions. Values alignment means that an AI adheres to certain values, even if it is making its own decisions.
The classic idea for aligning a superintelligence is that you determine its values before it is superintelligent, and then once it becomes superintelligent, it doesn’t want to change its values, even if in theory it could do so.
The extent to which this makes sense depends on the architecture of the AI. In a design where there is a very clean separation between the utility function / value system, and the problem-solving intelligence, there seems a good chance that the value system will remain stable even as the intelligence increases. On the other hand, if you have an AI where the decision-making results from a complicated interplay of multiple competing goals, there is much more opportunity for unexpected value systems to emerge.
As spokespersons for opposite sides of this argument, I would pick Steve Omohundro, who I believe has argued that as intelligence increases, there should be an increasing tendency for an explicit and protected value system to emerge, and Guillaume Verdon, who thinks it is more adaptive for an intelligence to remain deeply flexible in its goals. (Maybe @Remmelt and his guru Forrest Landry should also be mentioned in the second camp—they have argued like you, that an AI civilization would necessarily drift from its original imperatives under selection pressures.)
It is also far from clear to me that human values are inscrutably complex. The idea that they are darwinistically produced neurogenetic “spaghetti code” is a familiar one. However nature and biology also contain emergent simplicities. The ethical debate among human beings often circles back to pleasure and pain as the ultimate reference points. Some version of Benthamite hedonistic calculus, where you’re just adding up pains and pleasures in some way, may turn out to be a convergent ideal, not just for human beings, but for many possible forms of conscious mind.
You say that similarity does more than empathy to keep human beings living in positive-sum ways; from a different angle, @RogerDearnaley has recently been arguing that universal altruism is not enough for values alignment because, to paraphrase, predators enjoy eating their prey, even if the prey is a universal altruist. Dearnaley has been arguing that human-friendly values alignment needs to treat aggregate human well-being as the terminal value (i.e. as an intrinsic good), with well-being for other entities (animals, aliens, conscious AI, etc) to follow as a derived good arising from human empathy, and therefore capable of being deprioritizied if their well-being would come at the price of our well-being (or even just our survival).
A further possible consideration is an old perspective due to Eliezer, who in his very early days sometimes said, the ideal in AI is not to create something perfect that then rules the universe for all time, the idea is to ensure that when the torch of intelligence gets passed to something more capable than us, as it inevitably will, that this should be something which is morally and not just epistemically superior to us. It may still be fallible, but if it is less fallible than human, while nonetheless maintaining the imperatives that have defined the best of what we are, then that is a desirable outcome. This is a response to the idea that AI values will inevitably drift: human values drift too, and in disastrous ways. If an AI can learn good human values, while being superhuman in its fidelity to them, then that is better than a future in which e.g. humans just wipe themselves out with advanced technology.
So I don’t agree that the idea of creating aligned superintelligence is necessarily a bad idea for all time. There may be a level of knowledge at which you could do it while genuinely knowing what you’re doing. However, that is not our current world, our current world is more like “do it and hope you can figure out what you need to know along the way”. (Soon it will probably be just, “do it so the AI can save you from the economic and military fiascos you’ve created for yourself”.)
I have something like respect or sympathy for people who want to stop AI completely. I try not to get in their way. But given the widespread proliferation of knowledge on how to make AI, I will keep working to reach that “level of knowledge” at which we would “genuinely know” what we’re doing.
It is our values that are infinitely complex and can’t be encoded into AI. Our values are simply our desired state for the world. This state changes automatically and constantly. An AGI aligned to our values is impossible to create precisely because the most fundamental value we have is the ability to spontaneously decide what our preferred future state is.
With an AGI more capable than you, you would lose that freedom. If it simply did what you wanted perfectly, yeah, that would work, because then it would simply be an extension of your volition. But that is not an intelligence. That is a piece of technology that empowers you to better and faster do what you want. It is not autonomous. It is perfectly contiguous with you. This is why it’s impossible. It can’t be a perfect servant(which is what value alignment would actually be as I just explained) if it has its own mind.
The AGI will not simply do what you want perfectly. It will be able to do incredibly impactful things that you will have attempted to train it to fit your preferences, but it will do so imperfectly, and that imperfection is what will prove to be too much to handle.
We have extensions of our exact will already of course. The difference with them is that they tend to effect very weak actions on average so we have the ability to correct them. And even that to a point. But if you increase the world’s ability to drift from your ability to change it, you lose. That is what losing is. And you lose badly. Do it strongly enough and you lose permanently, and on behalf of all humanity itself. That is what AGI represents. And it represents that because its capacity to change the world is on track to outpace our capacity to correct its inevitable differentiation from us.
The most significant mistake you make here is to underestimate the delicateness of this differentiation and its implications. You assume that you can simply get it mostly right from your perspective and that that is fine. That’s for two reasons. You’ve been argued into thinking that a world without you which follows your values in your absence is a good world, and you don’t appreciate just how vulnerable we are to the environment getting out of control.
You present certain arguments according to which having AGI around is inherently intolerable or cursed. But they seem to be getting very general, so general that they could be reasons why a child must not have parents, or a country must not have a government. There just cannot be a power over you, that diverges from you even a little. Could you clarify what’s wrong about “rule by AGI” that doesn’t apply to “rule by parents” or “rule by the state”?
Humans have empathy, a mutual need of each other to survive, and can keep each other in check. None of us is intelligent enough to act unilaterally against the wishes of everyone else. None of us would want to, unless they’re insane, and then couldn’t because they would lack the intelligence for it.
Parents have biological imperatives. If you want to recreate such an imperative in something else, you would have to make that something else actually human, if you make it anything else, the empathy doesn’t simply fit(not that you would ever know how to recreate it to start with). The short answer here is that we don’t know how this empathy works, and that not knowing we can’t assume we will somehow know in time. It is my opinion that this is unknowable for the purpose of transferring into an AI. These drives aren’t conscious, per se, they have been encoded in us over enormous aeons, we do not in fact understand them in a scientific sense(...to the extent you would need to, obviously), we merely understand them experientially, which can cloud one’s judgement about our ability to manipulate it in the ways you would seek to with an AI.
The state has similar drives. It has a certain sort of empathy, no matter how strange this might sound. But more relevantly in its case, it has to obey the simple laws of politics which say that it would be very hard for it to set out to attack those it is expected to lead and protect. At some point in trying to do so, it would run into the simple problem that it is made out of people who wouldn’t see it in their interest to carry out its orders. We are all the state in some sense.
Also, as humans, we have an inherent track record which matters quite a lot for the sake of contemplating what we might do to each other that is too destructive. A long history of being able to coexist.
The problem is not that I am too general. It is that you anthropomorphise incredibly easily, and it doesn’t make any sense. An AGI would obviously not care about our ability to overpower it, because this ability wouldn’t exist past a point. It wouldn’t care about us at an emotional level because it lacks our particular ability to feel such emotions, and it lacks the history that created them blindly. Emotions are barely even things we can very clearly reason about. We have no clue what they are. We know only our feeling of them. You can trivialise them as things that mean simply something conducive to your goals or not, but that seems far too reductive. Not because it might not be just that, but because such explanations as we have, including this, are very likely incredibly deceptive as to their underlying complexity.
There are many ways to talk about this. But ultimately, you lack an answer for the question of what do you do exactly about something you can’t predict changing the environment in ways you can’t react to after the fact, if it’s bad? And why would it ever not be bad? Why would you ever assume that things going badly isn’t the rule, absent human direction?
I think you’ve grown incredibly complacent about our apparent comfort and safety as a civilisation, and take this feeling wholesale and apply it optimistically into the future, no matter how extreme the circumstance. The truth is we are not as evolved as we think of ourselves to begin with, we are incredibly fragile, and if we lose control over our ability to shape our environment, there is no reason to think that will go in any way except completely badly. It always does. Anything we do that is not deliberate ends in disaster. The universe is not an inherent joyride. Everything we’ve done to make it resemble one has been deliberate, not guaranteed.
Our ability to control our environment is key. Our ability to understand our systems in detail is key. Without this kind of awareness, we are completely defenseless and have no reason to expect anything good. This should be entirely obvious.
Also, we have no choice when it comes to ourselves. To ask “how can you trust humanity” is akin to asking “how can you trust yourself?”. There is no negative answer possible here. We are forced to operate in this way.
If you permit, I’ll summarize a lot of that as follows: the reason that “rule by AGI” is different is that it is so alien, and we don’t know how to make it significantly less alien.
The argument from alienness still makes sense, but its strength has eroded somewhat in the era of conversational AI. It turned out, not only that directing powerful general pattern-matchers at the human textual corpus gave them the ability to talk like a human being, but that it induced in them an internal conceptual structure that humans are capable of interpreting.
An optimist might say: maybe we can use these techniques to create a first approximation to an anthropomorphically benevolent being, then ask it to devise superior techniques which will be sufficient to create the real thing, trusting that enough concepts have been correctly inferred, for it to figure out what is wrong or missing in our specifications.
This kind of optimism is based on the hope that anthropomorphic benevolence, as a target in the space of possible minds, is surrounded by a “basin of attraction”. All we have to do is land in that basin, i.e. we only need to specify the goal up to a certain degree of accuracy, and provide the task to a mind which is sufficiently close to anthropomorphic, and any details that were wrong or missing will be corrected and filled in, by the intrinsic logic of the problem.
Regarding empathy in particular, I think any mystery pertaining to it is largely because it involves consciousness, and consciousness remains a fundamental problem rather than just a technical problem, for scientific understanding… I spent quite a few years “studying consciousness”, from the perspective of wanting to understand its nature and how it relates to everything else, and this is an area in which I believe in the possibility of a conceptual breakthrough. That is, if the right connections are made, the rift between subjective experience and the naturalistic worldview could be closed completely, and many mysteries would fall into place.
Now, I’m going to cut myself short here, even though there is much more to discuss in your (very helpful) critiques. Hopefully I can return to them. But I just want to say a few things about where I’m coming from.
I have presented these sketchy solutions, or reasons for hope, to a handful of your objections, not because I am decidedly optimistic, but just to indicate where the counterargument lies. We simply do not know whether there are forms of artificial superintelligence which would naturally coexist well with humanity, or whether it’s a tightrope walk to coexistence, no matter what design you use. That uncertainty alone should be reason enough to stop what we’re doing, but that’s not how our elites see things. To those who want to stop the juggernaut, good luck. But as a theory-minded person, I intend to work on the theory of how to steer the juggernaut so it doesn’t crush us. One reason for this focus is that there truly may be very little time.
I’ll be back when I can.
I think you understand the problem a little better now but you still think alignment is possible, and it’s not, it’s a complete waste of your time to try to solve it.
What you still don’t grasp is that the human is themselves the perfect alignment target, and is something infinitely hard to discover ex nihilo. Anything short of human is apocalyptic when it comes to greater powers than us. You still think otherwise.
There’s a terrible flaw in your logic. If an AI could solve its own alignment it would already be aligned. Because alignment is a continuous process, for the reason I explained, namely that our first value is the ability to choose our preferences, and our preferences shift continuously and inevitably.
I don’t see that consciousness matters to this topic at all.
Optimism is bad. It deliberately seeks out delusions. Pessimism is concerned with predicting the worst in order to be prepared by it. It has no need of the human to reason about it and correct it, once in place. It is the perfect strategy. If it fails, everything fails.
Let me put it in video game terms. Imagine you’re playing a competitive videogame. Being an optimist about how easy the game is means seeking to play with similarly skilled players so that you can “learn better” and at your own pace. Being a pessimist means that if offered the chance to play against nothing but the strongest players, you take it.
Why?
Because any single action you take against these players will be a lot more successful against lesser players.
Why?
Because you are inherently training for playing the game in the best possible way. You are forced to do this. Any move the opponent makes that succeeds inherently teaches you about what has no business of working against anyone else(namely the strategy you were employing at the time you were playing against them). You could have done the same thing in a game with lesser opponents and won. But you now have no reason to. You know that winning like that is improbable, provided that the whole playerbase learns as a whole. It is a strategy that works only temporarily and against worse opponents. It will sometimes work against better opponents but very rarely. One day it will not work against anyone(at least asymptotically). You are learning about what strategies will not work over time. This is why you play against better opponents. And this is why you are a pessimist. Because the universe works exactly the same. Being an optimist means training in an environment that is unlikely to produce successful results, and that even if it does, will not do so for long. The best possible thing to do is to train with perfection in mind. And for that you need a better opponent. A stronger orientation towards what can really go wrong.
For which definition of alignment?
There are a number of routes to AI safety.
Alignment roughly means the AI has goals, or values similar to human ones , so that even acting agentively without supervision, it will do what we want , because that’s also it’ what it wants. There is a lot of semantic confusion between people who use “alignment” in an engineering sense,meaning something that renders current AI safe in good enough way—and the people who use it to mean a maths style solution , that applies perfectly to every case.
Control means that it doesn’t matter what the AI wants, if it wants anything, because we can make it do what we want.
Corrigibility means alignment that can be changed once an AI is up and running. Control could be considered extreme corrigibility.
Non agency. Alignment and Control are both responses to agency. A third approach is non-agentic “tool AI” which responds to a specific instruction or request. Current (2025) AI’s are fairly tool like.
That’s not literally true, since there a finite number of people with a finite number of neurons each.
You wouldn’t be able to create a sovereign AI that is given a fixed set of values match human values .. but there are other things you can do to achieve safety , like solving control as opposed to alignment.
Humans aren’t aligned, in the sense of having identical values—there are constant disagreements about basic values. That makes safety via alignment impossible , but doesn’t make safety impossible.
Why? I have seen that asserted, but I have never seen a valid argument for it.
We non-doomers are not convinced by the arguments, we have seen, where they exist at all. Therefore, doomers need better arguments, not more comnversations.
You don’t know for a fact that there are finite people with finite numbers of neurons, but leaving that aside and accepting it, what’s infinitely complex about that set of people and neurons is the information contained within them. That’s because you can’t keep up with two people’s thoughts at the same time, let alone with eight billion of them. The infinity is in the fact that you can’t experience what anyone else does completely. What you get is emergent alignment(understood as similarity, not identicality), but one that is not a choice, but a given, and an old one at that. So, you don’t get alignment this way with AI. In its case, it’s neither a given nor old. That makes it inherently dangerous, no matter what you tell yourself about its structure as you perceive it.
If you have issues with my usage of infinity above, I advise you to understand that words are intrinsically polysemous, and we use them as carriers of meaning, and not the other way around. I am trying to explain things to you, not have a debate about definitions(though we can if you like; except if you do it by simply asserting that I should only use words in the way you prefer, I will simply ask you why and we can go from there). If you didn’t understand what I said, feel free to tell me and I’ll explain it further.
You can’t make a system smarter than you corrigible, because to make something corrigible you have to understand it, but if it’s smarter than you, you don’t in fact understand it. This would be a good place to ask you what you understand by intelligence, since it’s the reason you believe otherwise. Specifically, tell me what sort of system you imagine an AGI to be such that it is both smarter than us and corrigible. Tell me what that means, however you please.
Let’s go through everything you believe “we can do to achieve safety”. Nota bene: when I speak about alignment I employ any one of the definitions you would normally interpret according to context(as it applies to the question of an AGI being safe). Specifically, tell me what it is that you think makes it so we can make systems more complex than ourselves safe. Or, if your claim is that safety doesn’t have to be demonstrated because it’s a law of physics, then we can debate what you understand by safety. Kindly tell me what makes it so everything is safe and as such nothing has to be defended as having that trait. Alternatively, if you admit that you don’t know that AGI can be safe, I would ask you why you want to create it to start with. Is there something else you consider more dangerous than AGI that you want AGI to protect you from? Which is? Let’s assume we get into that conversation, and there is such a thing, next you should tell me what reasons you have for finding that thing more dangerous than AGI.
Note, these are mere suggestions, feel free to reply however you want, of course. I will try to align with your understanding such as to make my arguments further regardless, and I’ll gladly update my beliefs by adopting yours if they prove to be more coherent than mine.
Anything short of human is apocalyptic because humans are the only entities we’re aware of that actively seek to protect humanity as a whole. The rest of the cosmos is actively trying to kill us. This is universally true. We have no reason at all to think it is possible for anything not to want to kill us, or not succeed in doing so if we gave it control over our environment. Things observe their own laws of action, including things that resemble us, but are not us. Those laws are not identical with our laws, and as such, are at odds. If given too much discretion, they will erode our experience to the point of deletion.
I would tell you that one of us has a more correct view on this than the other. It is in both our interest that we discover who does. Because regardless of who it is, both parties are interested in either of the two making better decisions. You agree with this, yes?