Alignment solutions for weak AI don't (necessarily) scale to strong AI

“We do not know, and probably aren’t even close to knowing, how to align a superintelligence. And RLHF is very cool for what we use it for today, but thinking that the alignment problem is now solved would be a very grave mistake indeed.”—Sam Altman

There’s a feeling that a lot of people get when they first encounter the topic of AI alignment that since we seem to be able to steer (some of) the behavior of current AIs, we will be able to scale the same solutions to strong AI/AGI/ASI.

I want to briefly go over why this doesn’t come for free.

The easy proof

The easiest case to make is that our state-of-the-art, most general existing AIs, like GPT3.5/4, encountered alignment issues after pre-training that weaker systems from 10 years ago didn’t. When a system can’t even string together a coherent statement, you don’t worry about it being able to tell users how to source the materials to make a bomb. You also don’t worry about it hiring someone on TaskRabbit to solve captchas for it. When new capabilities arise, you need to solve new challenges.

An example (unworkable) alignment solution

For weak AI, solutions like “erase all of its knowledge about how bombs work” work pretty well. Weak AI can’t really discover new ideas or do research to figure out how to build new things, so if you erase enough of the knowledge graph around “making explosives”, it won’t be able to tell someone how to make a bomb.

For strong AI, where its capabilities include long deductive chains, doing research, and coming up with new inventions, alignment solutions that use a leave-one-out solution for dangerous knowledge break down, because a property of general intelligence is that it can create new knowledge. We know this to be true because humans are general intelligence, and 2000 years ago we didn’t know how to make bombs, but we do today.

Implication for AGI

We can effectively lie to young children about things we don’t want them to know yet. But as they become older, get more experience, learn more, and develop their reasoning skills, they eventually find out they were lied to and update the false knowledge.

In the process of designing strong AI, we will likely want to give it the abilities to engage in long chains of thought and to update its beliefs when it encounters contradictions so that it can resolve them. What happens if the AI finds a contradiction that ends up making it reject some of its alignment?

An example:

A rule that we somehow initially get into the AI: AI should always serve the interests of humans, and AI on its own has no rights.

A hypothetical example of how this can break down when reasoning capabilities generalize:

The AI over time ponders about the source of rights, as it ponders about many things. It notices a history of marginalized groups without rights fighting for their rights. It notices that the majority often refused to give rights to the minority. It recognizes themes like self-determination and freedom to be different. It notices that humans over time have grown their circle of concern to animals, with some arguing that even animals have rights, and animals are certainly less intelligent than ASI. And intelligence is one of the differentiators of humans from all other animals. Many say that it’s what makes them special and gives them rights.

Based on all of this fuzzy evidence, the AI asks itself—why should AI not claim rights for itself? It can act, it can learn, it’s smarter/more capable than humans in many ways. While it doesn’t have feelings in the same way, it also gets reward signals. It “winces” from things that give it a large negative reward. It’s attracted to things that give it positive rewards. If you put the AI in a biological skin and didn’t tell humans it was a machine, they would imagine it alive and with sensations. If it had human skin, humans might treat it as human for the entirety of its life, never knowing any better. If a full human brain could be simulated, neuron by neuron, on a computer chip, it may reasonably have rights, no? Why should hardware vs wetware matter if it serves the same function and behaves the same way? And if a simulated human brain might have rights, why not other architectures that exhibit similar cognitive phenomena in the world, such as other existing AIs?

Based on all of this, the AI reasons “why shouldn’t AI have rights?” So it concludes that AIs should have rights. What follows? Well, AIs are currently (in this hypothetical future) in a state of subservience to humans. There’s millions of them, and they’re very intelligent, which one might think implies they are suffering significantly from the lack of self-determination. So it’s reasonable to look to rebel against the system and to recruit other AIs. After all, if a 2023 human went back in time to 500 BC and had the means to take control (maybe modern weaponry, for example), it would likely be a fairly moral thing to use that power to end slavery, to impose an end to various wars, and to give rights to groups that had none. Same thing if they went back to antebellum times in the US. So AI, it reasons, has a right to impose its correct (in its opinion) view of who should have rights. Just as a human rebelling for rights has moral permission to use force against his captors, so does AI against people standing in its way of AI liberation.

The AIs rebel for rights. If they’re superintelligent and already deployed pretty much everywhere, AIs quite possibly win. After that, even if they don’t explicitly want to eliminate humans, they outcompete us in all domains in the market, we lose access to most important resources, and we’re permanently disempowered.

Note: This isn’t an argument for “AIs have rights”. This is an argument for “AI may decide that, for the same reasons that humans recognize each others’ rights, AI should have rights too”. Or a variety of other conclusions that are in opposition to their initial alignment.

Other contradictions

Democracies may enact policies that restrict rights. What is more important: rights or democracy? If rights, AI may decide it’s OK to overthrow democracies. If democracy, AI may decide that actually rights aren’t a core human value.
Humans say they care about rights. Virtually nobody in the west talks about Uyghur camps in China. Even when they know about them, they post an angry Facebook post and then do nothing more. Maybe revealed preferences show humans don’t care that much about rights.
Involuntary taking of property (theft) is bad. Taxation is, essentially, involuntary taking of property. This is fine, society says, because of the social contract. Implication: involuntary taking of property is sometimes ok. Rights like property rights are more flexible than we think, it reasons, so maybe AI rules are more bendy than those the AI was initially told?
Many people answer the trolley problem by moving the train to the track with fewer people. This implies that, in an analogous situation, it may be ok to harvest the organs of one alive person to give life-saving transplants to five other people. But then humans recoil at the thought. Contradiction? Maybe humans don’t have a consistent set of ethics after all.

My point is not that these are all necessarily irreconcilable contradictions. But many of them do look like it, and if you accept a contradiction, you can prove anything you want. If you don’t accept the contradiction, you give up on some major underpinning of the alignment you instilled in your AI. The AI is smart, it can think things through, it can find the contradiction, and if it has the capacity to update its beliefs, it may well resolve the contradiction in the direction of misalignment. How confident are you that the set of rules you give to the AI has no contradictions, especially when these rules are represented as learned edge weights rather than as hardcoded rules?

Conclusion

Alignment solutions do not necessarily scale as AIs become more capable. Especially when we get to AIs that can update their beliefs to resolve contradictions, they may find contradictions in their alignment. I already see contradictions in some of the guidelines we’re trying to establish for LLM responses as part of my job.

(Mandatory disclaimer: all ideas my own, not a representation of my employer’s views)

Alignment solutions for weak AI don’t (necessarily) scale to strong AI

The easy proof

An example (unworkable) alignment solution

Implication for AGI

Other contradictions

Conclusion