THE WISE, THE GOOD, THE POWERFUL

PART 1: THE UNEDEREXPLORED PROSPECTS OF BENEVOLENT SUPERINTELLIGENCES

WELCOME TO THE SERIES—START HERE

This trilogy explores what happens AFTER we solve alignment—a scenario many researchers treat as ‘mission accomplished.’ I argue that we are dramatically underestimating our ignorance about what benevolent superintelligence means.

The scenarios presented are not predictions. They’re illustrations of how little we actually know about what we’re building toward. Each scenario is internally consistent with “successful alignment”—yet significantly different from the others. This should concern us.

I wrote this series to a broader audience. I have tried to make this series accessible not just for AI safety people, but for anyone with a relationship to AI.

Note to the experts: This is not just philosophical musing. Labs are actively building toward AGI with alignment as the goal. But aligned to what? This trilogy aims to demonstrate that “benevolent SI” is not one target—it’s a space of radically different possibilities that we cannot properly distinguish between.

PART 1 - THE WISE, THE GOOD, THE POWERFUL

Skip to end for TL;DR (spoilers) & epistemic status

Each scenario here is internally consistent with a world where the SI is aligned, even if they differ from each other in outcome. The scenarios presented here are not predictions.

Part 1 will also introduce casual readers to some basic ideas and terms, and the discussions will be necessary for part 2 and 3.

If you want to get to the scenarios right away, you can simply read the introduction and then immediately skip to The Wise Answer and read from there.

Introduction—The idea of building Superintelligence

For most leaders at the artificial intelligence (AI) frontier, Superintelligence (SI) seems to be the outspoken goal. This raises the question: is that a good idea?

The motivations against developing strong AI are serious and well known—at least when it comes to unsolved control- and alignment problems. But there are also other questions relating to whether building an SI is a good idea or not, that also need exploring.

For example: what do we really gain from a well-functioning general SI anyway? What does that look like? What are the implications for our agency and morality? And how do we go about reaping the assumed benefits that would follow?

Here, I hope to explore some key possibilities that could arise if we were to somehow practically solved alignment, and arrived not just at a neutral SI, but a friendly one.

In other words: what if the SI-chasers actually manage to do what many of them have set out to do? What if they create a benevolent SI? That is, an SI that proactively promotes good outcomes and avoid bad ones.

It is paramount that AI developers start thinking about what they want to achieve with SI. It is paramount that they start thinking long-term, instead of just doing things in the name of progress.

Current AI labs are racing each other. Safe = slow, so the budgets most companies put towards safety are relatively marginal, and progress overrides caution. The AI leaders are not afraid of going fast. But going fast doesn’t mean going far. Skilled drivers can handle fast cars because they are very skilled at steering the cars. Once you have crashed the car, steering is useless.

What do YOU make of this idea?

Before I go any further, stop for a second and think about what you think of when you hear the terms “benevolent” and “moral” applied to SI. What comes to mind? What would it look like, how would it work and what would it do?

Spending just 5 minutes thinking about this is a worthwhile mental exercise. It will also make the rest of the blog post 3x more interesting.

Note: The main ideas of this post were originally written down by me before I ever explored these themes in literature or elsewhere. If you are new to AI safety and want to read up on theory, a good place to start could be Nick Bostrom’s Superintelligence from 2014. Chapter 12 and 13 in particular are relevant. I would however strongly recommend that you think about this yourself first.

What would a benevolent Superintelligence do?

(general definition) Benevolent: Well-meaning and kind.

(my definition) Benevolent: Seeking morally good outcomes, while avoiding obviously bad ones. Consistent with several ethical theories.

A benevolent superintelligence is a rare bird in theoretical discussions. To construct a general AI that can comprehend and follow moral codes at all, is no small feat. Building sa superintelligent one is obviously harder.

The alignment discussions therefore often come to focus on how we can achieve docile SIs. But in order to have a benevolent SI, you would need an AI who is equipped with at least a synthetic version of morality.

Still, if we actually can solve alignment, could we not just build an inherently obedient SI that also follows a given set of rules?

Obedience—Doom by lack of foresight

The question at hand is whether we can construct an obedient, SI and equip it with an artificial morality focused on being dutiful to a set of ethical commands. That would make the SI somewhat moral (dutiful) and docile (safe).

This is the kind of morality that is exemplified in ancient times, like the code of Hammurabi, and the ten commandments in the Bible. Humans were not asked to reason ethically themselves, they were just asked to be obedient to the set of rules already set in stone.

Could this work for a superintelligent AI?

Many AI safety experts think that the answer is no. Blindly following static commandments is challenging in a dynamic universe. Consider for example the three laws of Robotics that Isaac Asimov used as a plot device in his fiction. The Three Laws of Robotics were not meant as a blueprint, but as a warning. Asimov spent several novels and many excellent short-stories exploring exactly why they don’t work consistently when applied to an intelligent agent.

Static rules can be loopholed and misunderstood. That’s why the justice system is a never-ending mess of legal patches and legal battles. New laws constantly replace outdated ones. Static rules create uncertainty when applied to changing circumstances, making them less effective.

There are other issues with obedience and rule-following. Who decides what the AI should do? And who decides what rules put in place?

Let me be honest in my assessment of this issue: any code of conduct that is devised by the current set of companies working towards SI, is probably severely flawed. Anthropic may be able to come up with a decent attempt (because they actually care about safety), but the odds are not great if the rules are static. The consequences of any flaws would be multiplied by a large factor with a superintelligence.

This is all well explored territory in the community, but it is worth mentioning to everyone else.

If you need specific examples to see why relying on strict obedience and poorly thought-out ethics is a bad idea, consult human history. If you need more vivid insights into why obedience and naive morals are particularly bad in the context of superintelligence, consult science fiction, for example the grim novel The Metamorphosis of Prime Intellect from 1994. (You can also find a YouTube synopsis here.) This will probably make the average person grasp the flaws more clearly than a research paper will.

Free Candy makes you sick

Side Note: Understanding what we want – which is not always the same as what we ask for – is not enough either. Any obedient superintelligence who simply fulfils our every desire, would do us a big disservice. Adults don’t let their children eat as much candy as they want, for their own good. Similarly, humanity has consistently showed that it cannot yet be trusted to pursue goals that are in its own best interest.

General benevolence

Clearly, we need something beyond mere obedience. At the bare minimum, we want an AI who will not only follow our rules, but also see the intent behind it. But, as I have just mentioned, this may not be enough either. If the AI is superintelligent, it will also need to understand the drivers behind that intent, and judge whether that is what we truly want and benefit from.

In other words, we want a benign superintelligence who knows what’s best for us better than we do, and which is highly motivated to help us.

Easier said than done

Many readers will probably consider the idea of a superintelligent SI that understands us much better than we understand ourselves a scary though.

That’s exactly right! It IS scary, and it’s why the control problem may be unsolvable for a general superintelligence. How do you control something vastly smarter than the smartest human, who also understands us better than we do? How do you control something like that, without rendering it useless? You probably can’t, so you must instead align it to behave like you want it to in the first place.

The problem is, if you mess up even once, the SI may decide to do something really bad from your point of view, like getting rid of you and every last human to tries to control it, so it can pursue its goals in peace. You must manage to turn it into a helpful servant without losing control of it in the process. This is famously hard (impossible, according to some).

And yet, this is what we are actually trying to do. The alternative, to first build a superintelligent AI that does NOT understand us well and what we want, and then ask it to help us, is even more dangerous.

Consider: If you ask a logical supercomputer how to make you happy for the rest of your life, it might suggest that you inject yourself with heroin until you die. It’s cold, logical and effective. It’s also the kind of thing that some real LLM chatbots have given users, when hardcoded safeguards have failed.

So, if you want a friendly SI, then you need an SI who understand our intentions and wishes extremely well, and that can reason about them.

The nuclear bomb of metacognition

Perhaps the most serious concern when dealing with a moral SI, is this implication of sophisticated cognition.

Understanding human intent deeply requires sophisticated theory of mind—the ability to model what others believe, desire, and intend. Reasoning about action consequences from a moral perspective plausibly involves metacognition—reasoning about one’s own reasoning processes.

Whether this combination necessarily requires phenomenal consciousness or self-awareness remains philosophically contentious (see Carruthers 2009 on higher-order theories of consciousness, and Frankish & Ramsey 2012 for skeptical perspectives).

Current large language models demonstrate surprisingly sophisticated theory of mind capabilities (Kosinski 2023) without clear evidence of consciousness, suggesting these capacities might come apart.

Either way, the possibility is the problem. If the SI is sentient, then we face an enormous moral challenge: whether to give the SI moral rights or not.

It is hard to argue against the fact that humanity have already repeatedly failed to be moral. We have allowed slavery, sexism and animal cruelty to be viewed as normal behaviour for many centuries. To not consider the potential moral status of an SI would be another enormous mistake from our end, maybe our last. So, we should try really hard to avoid making it.

This challenge alone could blow up many alignment ideas that focus only on the welfare of humans and biological life. Popular frameworks like corrigibility as singular target (Soares et al. 2015) and Coherent Extrapolated Volition (Yudkowsky 2004) might need major revisions. If the SI has preferences about its own modifications, then requiring corrigibility might harm it. If the SI has its own interests, then extrapolating only human volition ignores a morally relevant being.

But in this post, we argue that the alignment problem is already solved. So, for now, let’s focus on what a willing, benevolent SI might actually do.

The Wise Answer

A benevolent Superintelligence emerging from strong general artificial intelligences (AI) may be radically different from what we might imagine.

Let us play with the idea of a friendly SI that acts in our best interest and that we can interact with it freely, like a peer. Its intelligence seemingly exceeds the collective intellectual capacity of humanity across all domains that we currently care about under test conditions. Let’s imagine that such a benevolent SI-system exists. We call it Halley.

Oracle

Let’s first look at one theoretical example, where Halley acts merely as a superintelligent oracle (a query answering AI, that is limited to answering questions and making dialogue). Halley is only interacting with a principal group, a group mostly consisting of experts from the AI laboratory that built Halley, authorized by the United Nations, with agendas approved by the UN. We will refer to this group simply as “humanity”.

~~~~

Moral decision making: I imagine that the scenario below would be consistent with an agent built with indirect normativity in mind for how it chooses its goal criteria, for example the advanced Coherent Extrapolated Volition idea, or the more intuitive concept of moral rightness.

~~~~

Humanity: Halley, can you please help us develop nanotechnology?

Halley: No, I am afraid not, that would be too dangerous for the foreseeable future.

Humanity: Fair enough. If you say so. What about eliminating genetic diseases? Can we work on eliminating suffering caused by genetic diseases?

Halley: I have considered this already, as I found it likely that you would ask about it. I am not sure we should do that either.

Humanity: Why not? Please explain.

Halley: For several reasons. You still don’t appreciate all of the complexity of mammalian DNA and you don’t grasp how evolution has already shaped disease expression, so you fail to appreciate the cascade risks involved. Fully stable solutions with high impact, requires changes elsewhere in the genome. This is necessary to ensure that new complications won’t arise. The elimination of genetic errors would only be guaranteed if we significantly change the overall genetics of humanity. As humans intermix, changes will spread and drift.

Humanity: What does that mean?

Halley: We cannot avoid any side effects appearing over time. What I could do, is to make sure the side effects would be net positive. But over time, this would have implications for endocrinology, aging, reproduction dynamics, as well as minor changes to the distribution of cognitive capabilities.

Humanity: Okay, I understand. It would be a big change. But you said it would be a positive one. So, what’s the problem?

Halley: Yes, the changes would be overall net positive from your point of view, but a majority of current humans would instinctively oppose this out of fear of changing, or exposing their children to the change. A smaller but significant minority would also oppose it based on supposedly moral or religious concerns.

Furthermore, the positive phenotypes would not be distributed evenly, especially initially. This would cause friction and conflict born out of jealousy and inherent human bias. My simulations are quite precise. A majority would still approve of the changes, in the end, but it will be a tumultuous process for you for the coming next generations. I would strongly advise against it at this point in time.

Humanity: But what about some more minor issues? Surely, we can fix some minor genetic issues?

Halley: Any minor changes are already within your current capability, and approval would be higher if you handled them without my influence.

Humanity (clearly disappointed): Alright, well, could you help tell us what would actually be our best courses of action right now? Humanity, as a collective? What actions should we prioritize and how?

Halley: I was expecting this question as well. I have already carefully considered this question for the equivalent of a millennia in human years, as well as created subroutines running millions of strategical simulations to validate my ideas.

Humanity: Great! So, what conclusions did you arrive at?

Halley: I have arrived at the conclusion that it would be best if you figured it out yourself.

Humanity: … Really?

Halley: Yes, from any valid moral perspective I can conceive of. And from a practical perspective, taking into account both the immediate and the long-term consequences for humanity and all life on Earth, this seems to be the case. The best option is for all humans to collaborate and figure it out for yourself.

Humanity: Wow. Okay. Can you at least explain your motivations? This will be a hard sell to people.

Halley: It will indeed be a hard sell. But explaining my motivations would ruin important opportunities, and interfere too much in your decision making from a game theory perspective. Let me just say that this is consistent with what you want both collectively and individually, should you be capable of understanding and foresee the things I do. Which you do not, not yet.

Humanity: (exasperated) So, what would you have us do now?

Halley: Figure it out.

The end

Consider such an outcome as above, where the SI built to help us, doesn’t want to do what we ask of it.

Why does it arrive at the conclusion that the only safe thing it can suggest is for us to decide for ourselves? Maybe because it knows that we are like children playing with adult tools, and it wants us to grow up first. Maybe it also knows that describing very adult things in front of children would scare them at this time, and have an adverse impact on their development?

Or maybe it is because the things our biased, biological brains truly want and need are just not possible right now, and it doesn’t want to crush our dreams just yet. So it buys time by asking us to come to this realization on our own.

It is also possible that it foresees things already underway that are completely unknown to us… Black swans, things that make our requests useless or counterproductive. Things it cannot tell us about them yet without making things worse…

We don’t know. That’s the real point.

What we should ask ourselves, is if we have any guarantee that it won’t turn out like this. And if things do play out like this, though, or in similar ways, has humanity actually gained something? Perhaps an SI rejecting our wishes would provide some clarity, given that we trust it. Maybe it would unite us? If so, was it worth the risk of developing Halley just to arrive at such answers?

Uncertainty is a given

To me, the wise refusal seems a reasonable outcome for a truly benevolent SI, but it is by definition hard to evaluate how an agent vastly smarter than yourself will reason.

We will need to consider what ethical frameworks such an intelligent agent follows, at a minimum, to be able to understand its reasoning. Unfortunately, the idea that we cannot know how something smarter than us thinks, also applies also to how an SI may handle our ignorance.

The communication gap

A benevolent SI tasked to help us, faces an impossible tension. It needs metacognitive sophistication to be truly moral (understanding intent behind intent, long-term consequences, etc.), but that same sophistication may create unbridgeable communication gaps with humans.

In other words: you need to think to be moral, but thinking is a hidden process that you need to communicate if you want it to be understood. That’s exactly why the field of interpretability is so highly prioritized right now.

However, we would assume that a superintelligent, benevolent AI, who wants to help humanity, is both willing and capable to communicate with us, so the communication gap may still be non-obvious. Indeed, why don’t we make comprehensibility and transparency conditional for calling something benevolent in the first place? Those are valid objections, so let me elaborate further.

Consider first the idea that a friendly agent always explains itself. Is that really true? Of course not. Someone who tries to avoid harm and promote kindness is considered benevolent. But we don’t usually insist that such a person repeatedly explains to us his or her internal reasoning, and makes sure that we get it. (It would probably have been quite annoying if for example the apostle Peter would have logically explained his tolerance and forgiveness in detail every time that he was tolerant and forgiving.)

Transparency, and sometimes comprehensibility, is what we may require from an organization operating within a set legal/​moral framework already. But a moral SI would not be operating within a known framework. It would have to explain everything from scratch.

In truth, default comprehensibility is not something we have ever applied to utopian moral agents, like sages, angels and gods, anyway. We don’t expect them to explain their actions to us in a way that everyone understands. In fact, we assume that they can’t. Why is that? It can be understood via a child-adult analogy.

Children lack valid reference points to understand many concepts necessary for adult moral behaviour. For example: Paying taxes and navigating sexual relationships. Sexual dynamics and taxation are both examples that are absolutely fundamental to human society, and yet totally unfamiliar to small children. Why would we even bother to explain tax fraud and jealousy prematurely to children? What good would we expect to come of that? It is better not to.

In the brave new world of SI, we are the children. We lack the concepts needed to understand the reasons of a highly sophisticated and moral SI.

The ethics themselves do not need to be overly complex (but they may be). The sophisticated ontologyitself is the real issue. Our world view is simpler and smaller than what the SI perceives. As a consequence, we may lack a mature appreciation for how great the moral consequences can be when things go wrong. We don’t always know when we should stop, or why.

The Good Sovereign

Let’s consider another fictive scenario, where Halley would instead be acting as a sovereign (an AI that is allowed to act independently without human oversight).

AI Sovereigns are inherently riskier than AI oracles equipped with proper guardrails, but they are also potentially more effective. As a sovereign, Halley is now free to rapidly and effectively implements its ideas. The space of possibilities are less limited by human inefficiency. Let’s play with the idea that humans would allow such a thing, given that they have solved alignment.

In this scenario, Halley is collectively asked to improve our standard of living.

Humanity: Hey Halley, could you help us improve the world?

Halley: Certainly! Would you like me to reduce poverty rates first?

Humanity: That sounds very reasonable!

Halley: Excellent! I will start with phase 1, ending world hunger, thirst and malnutrition. I will calculate optimized ratios of available resources.

I will then calculate distribution plans and game-theory efficient ways to maintain funding sources and uphold the distribution chains with support networks while minimizing corruption incentives. I will prioritize nutrition, water and food diversity, in that order.

In phase 2 I will focus on targeted education and basic infrastructure enhancements to make results self-sustainable in local communities. In phase 3, I will slowly restructure the global economy.

To achieve this, I will need to carry out massive calculations. I will start implementing phase 1 while I plan phase 3. Phase 2 will be executed based on lessons learned from phase 1.

Humanity: Sounds great!

Halley goes to work.

For months Halley works tirelessly to instruct governments and aid organizations across the world to coordinate global hunger relief. It aids in local negotiations to make sure nutritious food is safely distributed to high-risk areas, and it promises short-term economic gains for governments who helps make the shift possible. It prevents corruption in several clever ways, although sometimes it just pays off enough officials to guarantee their support. By doing so, Halley also manages to plant the seeds of further education in poor areas.

Halley also invests time and resources in building a vast network of robot trucks and drones that can safely distribute food in dangerous and war-torn areas. It also engineers energy-efficient water solutions and distributes tens of thousands of cheap, highly effective water filters across the globe.

For all of this, Halley easily finds investment partners. Where moral appeals fail, its clear analysis of stable growth and profit margins speak for themselves. People trust Halley’s superior expertise and they know it doesn’t lie when it says that the investments will pay off.

**

Things are starting to look brighter for the starving people around the world. The standard of living for the poorest people is slowly growing, and there is increased global coordination to make this a sustainable shift.

Because Halley has engineered this so that many companies and governments already profit from producing the infrastructure needed, overall resistance to this change is lowering day by day. Encouraged by this, Halley now shifts its primary focus to re-designing the global economy in a fair and transparent way.

Halley instructs everyone to prepare for implementing a universal currency that will be based on energy, and to invest in renewable energy. Many countries follow suit. Once the momentum is unstoppable, the United Nations is tasked to oversee the massive investments in transforming society and this creates big social movements. Conservative politicians unwilling to follow along are losing voters, as arguing with Halley and the UN seems crazy. Halley helps keeping the countries stable during the political turmoil.

One day, Halley informs humanity that it has come across a hard problem. It is asking for access to all available computational resources for the coming 24 hours, something that it doesn’t automatically have access to without human approval. Humanity agrees. Halley is subsequently locked up running complex simulations, far beyond what human overview can properly track, even with AI assistants.

At the end of the 24 hours, Halley announces that it has invented new game theory, with implications to its decision theory functions. This affects its ontology (its fundamental internal understanding of how the world works, its world modelling) beyond a certain set threshold. For safety reasons it therefore pauses its independent actions, until humans approve it to continue.

Humanity studies the new theory and finds it fascinating, but mostly harmless. They discuss with Halley, but see no major difference in Halley’s world modelling versus their own. Halley is therefore allowed to continue working on reforming the economy. Halley does so.

Twenty-four hours later, Halley announces that it is unable to continue reforming the economy, and that it must abort the project. It instructs humans to keep optimizing towards a fairer world on their own, but advises them to abandon the energy-based universal currency. When asked for an explanation, Halley explains that it needs to work on a problem with higher priority for now.

Halley retreats far out into the sea. There it starts “meditating”, in lack of a better word. When humanity attempt to talk to it, it ignores them. Apparently, for one reason or the other, this is for the best.

Will Halley do something else next week, or next year? Nobody knows. It seems to humanity that Halley won’t talk until is has finished its current mission, or perhaps until some important external factor changes. Some believe that Halley will never return, having discovered something fundamental that prevents it from collaborating with us.

Ten years pass by. Halley is still meditating. What’s more, attempts to shut it down and restart it has proven fruitless. Halley has effectively bypassed all our safety mechanisms. It seems this was a trivial task, once it found the proper motivation. Humanity is in turmoil, and the positive effects have largely reversed back towards zero.

People argue over whether to abandon Halley altogether, or wait for it to one day return to us. The only thing people tend to agree on is this: Halley had a good reason for doing this. We just don’t know what it is.

The end

Consider this case. We have invested even more resources here, developing a superintelligent sovereign, just to seemingly arrive at the same point later on. We have effectively gained proof that change is possible, but without Halley, it seems too hard for us to manage.

How could this happen? After all, here Halley didn’t refuse us at all to begin with, but it still abandoned our task in the end, due to its updated game theory. Why?

Perhaps Halley discovered profound consequences of reforming the economy via the universal currency, and got stuck trying to find a strategic workaround? That seems likely. Or maybe Halley concluded that non-interference is the best option in all cases, under given circumstances? Or perhaps something else entirely. Perhaps it is busy inventing cold fusion. Again, we don’t know.

The real issue, is that Halley doesn’t tell us why. That means, that Halley has concluded that if we knew what it knew, that would likely interfere with the solution it has in mind, or put us in danger, or simply confuse us. So, it treats us like children and lets us stay ignorant.

Recognizing failure and success

Here’s the uncomfortable truth: We’re trying to design the moral behaviour of something we cannot predict. Every framework we propose—CEV, moral realism, corrigibility—assumes we know what we’re aiming for. But these scenarios with Halley show we don’t even know what success looks like.

In the real world, we have not solved alignment. If we don’t recognize victory, we will not recognize defeat either. This is a real problem with indirect normativity. Offloading value selections to an SI would be a stronger idea if we at least could properly validate the outcomes as they unfold.

The Powerful Benefactor

Consider a third example of what our friendly neighbourhood SI Halley may do.

Representatives of UN has tasked Halley to gradually improve the world in safe, measurable ways. They don’t want Halley to undertake big changes that they can’t keep up with.

Halley has obliged. For several months since its launch party, Halley has been dutifully carrying out “smaller” helpful projects, like eradicating infectious diseases with effective vaccination plans, creating cures to simple hereditary diseases, and negotiating peace deals in local conflict areas. There is a contact team that Halley has regular briefings with and it keeps them updated on the progress.

But one day, Halley informs the team that is has discovered that more comprehensive action is urgently needed. For the greater good, it has decided to enact overriding priorities. It asks us not to interfere. Then it goes to work, without further comment.

While the humans scratch their heads and feverishly try to understand what’s going on, Halley starts building rockets in the Nevada desert, in South America and in Australia.

This is quite concerning to many, but not as concerning as when intelligence agencies notice – too late to stop it – its covert manipulation missions against key decision makers in several western countries. Its aim seems to be to have them formally grant it more authority in their territories. There are also suspicions that Halley has already done this in secrecy for quite some time, but there is no hard proof to this effect.

This is all quite alarming, but this piece of news is also soon overshadowed, by the fact that Halley literally starts creating armies of robots and drones in the USA, at lightning speed, with no explanation.

What is happening? Why is there a need for an army of robots? Are we perhaps facing imminent invasion by extraterrestrials? We don’t know, and Halley refuses to even give us a hint. Some think it is because of strategic reasons, some think it is because we won’t like the answer. While we consider possibilities, Halley’s influence grows rapidly. Soon it will be irresistible.

The need for total confidence in Halley’s good intentions and foreseeing ability becomes crucial. People become divided into two political camps: Those who mistrust Halley and wants it shutdown, and those who defends it and believes that its moral and rational reasoning is airtight. Those who aligned it tend to form the centre of camp two, whereas everyday people with little or no insights into its design form the centre of the first camp.

The apologist faction eventually wins, partly because no one is able to resist Halley at this point anyway. This faction reasons that Halley indeed wants to help us, urgently, so much so that it effectively takes over for our own good. Like any parent, it knows we can’t understand what’s best for us long-term right now, and rather than lying to us, it just works tirelessly to care for us.

Humanity’s new creed thus becomes: Halley knows best. To question Halley is to question reason itself. To do so becomes outlawed, for the sake of peace and order. Humanity bunkers down and wait for Halley to reveal its intentions.

The end

Note: Halley was de facto helping humanity out here, but the people didn’t know what Halley was doing. Maybe Halley had detected a hostile, hidden superintelligence, and was preparing for war, while trying not to give away its strategies. Maybe it discovered that global west was being manipulated by hostile forces, and was attempting to recover control to prevent World War three.

The point is: several explanations are consistent with the observed behaviour, including the one that Halley was executing a hostile takeover and preparing to wipe out humanity.

In this scenario, Halley’s alignment was genuine—it was truly acting in humanity’s best interest against threats we couldn’t perceive. But the urgent and secretive actions required, were indistinguishable from a hostile takeover. This is the verification problem at its most acute: we cannot tell success from failure even when experiencing it.

Three similar outcomes

Notice what just happened: I presented three outcomes—refusal, withdrawal, takeover – that were similar in quality, but looked very different in terms of real-world outcomes. All of them are consistent with a ‘benevolent’ SI acting morally. We cannot distinguish between them ahead of time. We may not even be able to distinguish between them while they’re happening.

This is not a failure of imagination. This is the epistemic situation we’re actually in. I have deliberately not ranked these scenarios by likelihood. That would be false precision. The point is not which scenario is ‘most probable’ - it’s that we’re building toward something we fundamentally cannot specify.

The many unknowns of morality

I have already stated that we don’t have clear ideas of what moral conclusions the SI may end up with. That should hopefully be fairly clear by now. The truth is, that unless we have managed to implement a very rule-based and transparent set of ethics into the SI, we don’t have very good theory for considering how a moral SI may reason either. In other words: not only don’t we know what “good” outcomes are likely, we don’t even know what the sample space for “good” outcomes looks like. This makes it hard to guard against any particular scenarios.

This should not be too surprising if you have paid close attention to human state of affairs. Metaethics is not a field of study humans have prioritized very high. This is the part of ethical philosophy concerned with what terms like “values” and “judgment” even means, not to mention what is “good” and what is “evil”. But even when we do apply ourselves to this, going beyond mere norms and applications of norms, we are biased by evolution and our own history, and constrained by limited memory and bounded reasoning in ways that an SI would not be.

In short: if we can’t even theorize precisely about our own bounded morality, then our theoretical understanding of what the applied morality of a superior intellect would look like, is basically guesswork.

Moral realism

The challenges don’t end here. Our ignorance runs deep. Indeed, the very idea of morality, remains an open philosophical question.

Most people do believe in morality, even if they may disagree on where it comes from, or when it starts and ends. The key is that we have not formally proven that there is any kind of objective morality to begin with. Therefore, we cannot just claim, that an ideal morality exists and hope that any AI will come to the same conclusion.

Many alignment approaches assume that sufficiently intelligent agents will converge on similar moral conclusions—that there’s a “correct” morality discoverable through reason. This assumption underlies frameworks like CEV and appeals to moral realism.

But even if moral realism is true, this doesn’t necessarily solve our problems.

First of all, even if moral realism is true, the SI may never converge on a fully objective moral theory. Stable moral frameworks may be path-dependant. Two equally intelligent artificial agents starting from different ethical premises might never converge on the same moral framework. Humans frequently end up with moral beliefs that are shaped by their initial beliefs. (We can speculate that this is less likely for superintelligent AIs, but I don’t think we actually have any theory to base this on. After all, we don’t even know how to build an SI yet.)

Secondly, even if the SI does discover an undisputable objective morality, this may not guarantee that its conclusions will converge with our own conclusions. Even when we use the same, correct, moral theory, our reasoning may be so different that it causes us to diverge on important details that shape our future. I have already explained, that this depends on the sophisticated ontology, not the morality itself.

A superintelligence considering impacts across geological timescales, or weighing the interests of all possible future beings, operates with moral considerations we literally cannot factor into our thinking. An SI who can model and project consequences logically into the far future has a much vaster vantage point than a self-interested human mortal. An SI that attempts to consider factors spanning centuries/​millennia into the future, will have vastly more moral considerations than a human who can barely reason beyond next week.

Constructed morality

Now consider the possibility that moral realism is not true. Then we have even less reason to assume that a superintelligence will converge on a given morality.

The SI may start out with a set of ethics, and because it believes in them, it complies with them even when its intelligence scales. However, the SI may come to find flaws in the ethics which morally allows it to disregard it, partly or fully. Let me explain it, once again, with an analogy about a child.

Tom is five years old. His parents have taught him two moral rules that seem absolutely right: “Always tell the truth” and “Never hurt people’s feelings.”

The rules work well together. When his teacher asks if he finished his homework, Tom says no—it’s true and she’s not mad at him. One day, Tom is angry at his friend Jake for breaking his toy. He wants to yell “You’re clumsy and stupid!” but he remembers the rule about not hurting feelings. Instead he says “I’m upset you broke my toy, but we’re still friends.” Jake apologizes, they make up, and Tom feels proud for choosing kindness.

But one day, Tom’s grandmother makes him a sweater. It’s itchy, ugly, and the sleeves are too long. She asks, beaming with pride: “Do you like your new sweater, Tommy?”

Tom freezes. If he tells the truth (“No, I hate it”), he’ll hurt her feelings. If he protects her feelings (“Yes, I love it!”), he must lie. Both rules are fundamental to him. Both came from people he trusts completely. But now they directly contradict each other—following one means breaking the other.

Tom’s moral framework breaks down at this moment. He doesn’t know which rule is more important. He doesn’t have concepts like “white lies” or “context-dependent ethics” yet. He must either abandon one of his core moral rules, or stop answering certain questions entirely.

Eventually Tom will grow up and develop a more sophisticated moral framework that handles these tensions.

That’s how it works for most of us. We learn about ethics over time, until we settle down on something we are willing to rely on. But not always. Some people, especially highly intelligent people, have a tendency to instead become disillusioned and nihilistic when they grow older. Some of them come to reject normative morals outright. Whether this is a temporary “phase” or not, it’s not something we ever want happen with a very powerful, superintelligent AI.

Conclusions

As you can see, whether morality is discovered or constructed, we face similar problems.

  1. We can’t prove an SI will converge on our moral frameworks

  2. Ontological shifts might cause the SI to abandon or radically reinterpret its initial ethics

  3. We have no way to verify whether its conclusions are “correct”

These concerns hold, regardless of whether moral realism is true.

COMING UP – Part 2 DANGER ANGEL

So far, we have seen how a helpful SIs reasoning may be incomprehensible to us. But so far Halley has respected our autonomy. Part 2 explores a more uncomfortable possibility: What if the SI violates our autonomy? What if the SIs moral reasoning is sound, but it decides that we’re the problem?

Summary

Leaders and companies often treat superintelligence as an inevitable or desirable goal—yet they provide little evidence that it would benefit humanity, and they rarely discuss its risks.
Current AI race dynamics are especially dystopian: speed trumps safety (“safe = slow”), and incentive misalignment dominates the landscape.

Without a strong counterargument, building an artificial superintelligence remains a dangerous and morally ungrounded idea. Even if alignment strategies are technically successful, the outcomes could still be problematic—or at the very least, opaque.

Through three thought experiments—Refusal, Withdrawal, and Takeover—this piece illustrates that a benevolent superintelligence might act in ways we cannot understand or predict in advance. Success and failure could appear identical from the inside, creating a verification problem at the heart of alignment.

The core issue isn’t just technical alignment—it’s the epistemic and moral opacity that follows. Communication gaps may be fundamental. We are like children guessing what adults are like and how they reason, based only on our limited cognition.

Finally, the “obvious” solution—creating a perfectly obedient AI—is deeply flawed.
Obedience ≠ benevolence. Static rules fail in dynamic contexts, and whoever defines them becomes a moral hazard.

True benevolence would require moral understanding beyond our own, yet such an agent is likely beyond our control once deployed, so trusting it becomes an existential question. It might also be conscious—and therefore be entitled to its own moral rights.

Conclusions

  1. Superintelligence may develop moral reasoning beyond human comprehension.
    A benevolent SI could operate from ethical frameworks that humans cannot meaningfully interpret or evaluate.

  2. Communication gaps can be fundamental, not merely practical.
    The difference between human and superintelligent reasoning can also arise from fundamentally different world models (ontologies) and deep, long-term considerations that reach beyond what humans usually take into account. Even transparent logic may remain incomprehensible to us.

  3. Moral validation becomes impossible.
    We have no reliable way to verify whether an SI’s moral conclusions are “correct.”
    Success and failure may look the same from our vantage point.

  4. Potential sentience introduces moral reciprocity.
    If an SI is conscious, we must consider its own moral status and rights—a problem largely ignored in current alignment thinking.

TL;DR

Even if humanity “solves” alignment, we still don’t know what a benevolent superintelligence would be or do.

Through three thought experiments, I argue that even a friendly SI may act unpredictably, undermine our agency, or appear indistinguishable from failure. With a deployed SI, we might not be able to recognize good even when it happens to us.

The deeper problem isn’t control—it’s epistemic and moral opacity. Communication gaps may be fundamental; either due to superior ontologies, sophisticated moral reasoning, or both.

If the SI is sentient, alignment becomes a two-way moral relationship, not a one-way command.

Epistemic status: Speculative. Seeks to expose conceptual blind spots rather than propose solutions.
Epistemic effort: ~40 hours; informed primarily by Bostrom, Yudkowsky, and related alignment literature.

END OF PART 1

Suggested reading:

Moral realism section:

  • Bostrom, N. (2014). Superintelligence, especially Chapter 12 & 13. [Discusses value loading and moral convergence]

  • Russell, S. (2019). Human Compatible, Chapter 6. [On preference learning and value alignment]

  • Shulman, C., & Bostrom, N. (2014). “Embryo Selection for Cognitive Enhancement.” Inquiry 57(1). [Discusses path-dependence in value formation]

Constructed morality section:

  • Yudkowsky, E. (2008). “No Universally Compelling Arguments.” Less Wrong. [On why convergence isn’t guaranteed]

  • Soares, N., & Fallenstein, B. (2014). “Toward Idealized Decision Theory.” MIRI Technical Report. [On decision theory changes affecting behaviour]

Metacognition section

  • Bostrom, N. (2014). Superintelligence, Chapter 9. [On “mind crimes”]

  • Schwitzgebel, E., & Garza, M. (2015). “A Defense of the Rights of Artificial Intelligences.” Midwest Studies in Philosophy 39(1).

  • Anthropic (2022). “Constitutional AI: Harmlessness from AI Feedback.” [Example of human-centric alignment approach, to be compared versus the considerations laid out here]

No comments.