AI Safety “Textbook”. Test chapter. Orthogonality Thesis, Goodhart Law and Instrumental Convergency

This is a test chapter of the project described here. When the project is completed in its entirety, this chapter will most likely be the second one.

We want to create a sequence that a fifteen or sixteen year old smart school student can read and that can encourage them to go into alignment. Optimization targets of this project are:

  • It must be correct. Obviously.

  • It should not be boring. It should be inspiring instead.

  • It should be easy to read. It should not be hard to read for school students.

  • It should not be stylistically similar to other texts with the same function. It should not be redundant.

  • It should contain as few technical details as possible. It is not a “how neural networks work” textbook.

  • It should help the reader enter the discourse of AI Safety and perceive other relevant texts in its context. It is more like “how to understand AISafetyspeak” textbook.

  • It should contain as little information which is likely to become obsolete in the next four to five years as possible. Because it’s for school students.

All images are generated by StableDiffusion.

Most likely, this is not the final version of the text. We welcome comments and criticism, both about the goals set and about the tactics for achieving them. However, please explicitly separate the former from the latter.

Superhuman Artificial Intelligence is dangerous to us. To be precise, it is going to be a real threat to the existence of the human race, as soon as we create it.

Why? Because we still have no idea how to make AI share and obey our ethics as it makes its decisions.

You might say, “Well, that smart AI is gonna be Really Smart so it can understand our values!”

AI is so smart, and consequently so kind, digital illustration

But there’s

one tiny issue:

to understand the values and objectives doesn’t mean to accept them as your own.

If people are capable—and they are, you can read Machiavelli or pop in some good advertising agency—of achieving their objectives, using, but not sharing the others’, then what stops self-educating AI to learn it from us?

We’re so disappointed and angry with politicians when we realize that their election promises are nothing but a means to promote themselves, not to make the world a better place. So how would it feel when we’d find ourselves fully controlled by superhuman intelligence and figure out we’ve been manipulated—but only in the moment when it’s too late already.

We’re afraid of psychopaths who’re capable of acting like they’re caring and compassionate, realistically and authentically, but the next moment they can be evil, without any twinge of remorse, because they feel like they’re allowed to do anything.

Hannibal Lecter cheerfully laugh

Needless to say, we can confront not only each other’s objectives, but the global “objective” of the inevitable process which made us—the evolution itself...

Thanks to the human consciousness we’re capable of making choices and—withdrawing from the procreation, risking our lives for an absolute stranger’s sake, dedicating our life to the Bigfoot search, self-destructing in the name of science (despite the fact it confronts with an evolution “objectives” that we found)

Imagine a meticulous alien scientist specializing in earthlings culture. Or culture scientist from Europe studying some preserved primitive cultures. A distant Egyptian’s descendant may be studying the way of life and morals of The Ancient Egypt, but it’s highly unlikely he will be inspired to offer a sacrifice to Horus or convinced of the necessity of slavery. The scientist’s own views and objectives more often make one stay alienated, to feel like an entomologist watching ants.

We’re able to see others’ values, to understand what shaped them, to accept the fact that someone near us wants to make the world fit—and still not to share them.

Freedom or safety? Personality or community? Justice or sympathy?

We still can’t find any agreements between ourselves, let alone an alien “intelligence”...

However, any intellectual level is compatible with any values. This is called an Orthogonality Thesis (orthogonality is a fancy name for independency)

Important concepts:

  1. Orthogonality Thesis

Alien blue-orange morality

What would intelligent crystals value the most? What would be a virtue for plants? What kind of a dream, what problem’s pressure would push an alien civilization with its alien consciousness forward? Let your imagination run free!

And now imagine a creature so curious that no suffering can stop it from satisfying its curiosity. Imagine someone who put honor or victory higher than life. Someone who can sacrifice any pleasure for safety, and someone who can sacrifice any affection for power...

Did you? Who first came to your mind?

There goes our top-charts: doctor Heiter, Japanese samurai, bubble boy and someone like Darth Plagueis

And that’s still people! But human values occupied a tiny corner in the space of potential values. You can read, for example, “Three Worlds collide” to get acquainted with something stranger to our point of view.

Yes, all those beings probably will be inclined to accept some common strategies and transitional objectives to achieve The Greater Good, but we’ll discuss it later.

For now we have to keep in mind the conclusion from the orthogonality thesis:

the values are not universally applicable, and being intelligent is not necessarily supposes having an understandable ethics!

And there’s our purpose: to make AI share the values that are really ours. No matter how badly we want it, they’re not written in the sky by God’s hand, and we can’t count on it to come naturally if it will be clever enough.


But, you’ll say, we will programme AI to have the right values, won’t we?

Even if we get rid of The Orthogonality Thesis and learn AI to have the same values that we do, there’s a second tiny issue:

it’s extremely hard to define the right values

Socrates isn’t sure he knows what “harm” is

Many people thought about that, from Socrates to Heidegger, and we have no definite universally applicable value model even now.

Yes, Azimov invented Three Laws of Robotics, but the whole point of his works is that they simply don’t work when faced with reality. It’s not enough to have three laws to foresee all the loopholes and misunderstandings—and also ten laws, and a hundred may be not enough, and it’s up to us to figure out how many is enough.

“A robot may not injure a human being or, through inaction, allow a human being to come to harm.”

Yes, of course!

Okay, now try to explain to a machine perceiving the world as zeroes and ones what is “a robot”? What is “a human”? What is “injure”? And, what’s the hardest—what is “harm”, what is “action” and how to tell it from “inaction”?

There’re many stories about genies-literalists granting wishes letter for letter, leaving you with two left hands when you asked them for two hands identically healthy, about cunning demons following the letter of the agreement, but perverting the point of it entirely and taking your soul as a price, and about stubborn Italian strikers doing something like this, but to a lesser extent.

For example, there’s a joke about some dude asking a genie for a canister with an endless beer supply—and no one could open this canister since.

After all, values are complicated, they contain many nuances of the concept, and setting the objective isn’t that far from making wishes...

Naturally, machine learning is specializing on the things that difficult for human to define formally—for example, on pattern recognition. Are we able to precisely describe formally how to distinguish if it is a cat or a dog in the picture? No. But to write a neural network to do it? Yes! But we can’t just let the dice roll in the case with values.

That’s okay to misdefine, just a little, what kind of a pet is in the picture, to make a single mistake from a hundred times. But the values, even just a little misdefined, when they are optimized on the super-intelligent level, will cause disastrous consequences.

This inconvenient feature of reality is reflected in the Goodhart law (originally economic, but broadly generalized in the AI Safety area).

KPIs are more important than productivity

The point is that if we want to achieve a hardly formalizable objective and enter the simpler intermediate metrics or objective which is much easier to formalize, then, if there’s enough power and efforts, the correlation between our objective and intermediaries we entered collapses—and we can get a lot of what we asked for, but everything we wanted will be lost in the end.

Important concepts:

  1. Orthogonality Thesis

  2. Goodhart Law

Or, otherwise: the means of measurement when taken as an objective accidentally or deliberately, is rapidly moving us away from achievement of the original objective.

There’s a funny story about a Soviet shoe factory that started to make only children’s sandals—because its efficiency was defined by the number of pairs of shoes made from a limited amount of leather.

Or let’s remember something not that funny—the educational reform… Seemingly, such an excellent idea—to create a unified test system that excludes any influence of likes and dislikes on grades—what could possibly go wrong!?

But in the moment when the achievement of hardly measurable objective (children are well educated) connects with the easily measurable objective (children are good at our tests), substitution of concepts goes seamlessly—and at output we suddenly have a virtuoso cheater instead of well-educated young man, or Chinese room, able to give you the right answer without realizing them.

The measurement system influences the processes it measures, changes them for itself.

When we think about the correlation of what we ask and what we need, we are most likely making some unconscious assumptions, we’re taking for granted that some ways to achieve an objective aren’t good for us. But if these assumptions aren’t clearly specified, then why would they work?

Here’s one interesting and demonstrative experiment.

In 1976 American scientist decided to check a theory, widely held at the moment, that predators are evolutionarily adapted to birth-control according to the territory they occupied. He selected to procreate predatory insects with the smallest offspring in every generation—expecting the smallest offspring to get fixed at the genetic level after a few generations of selection. But, much to his horror and astonishment, soon the population consisted predominantly of highly prolific cannibals, which without a twinge of conscience ate their own and others’ offspring.

Evolution has never heard of ethics, living beings chose not the best way to adapt, but the most accessible one.

By the way, if you want more examples of interpretations of unpredictability, we recommend you to read this essay.

But even if the Goodhart law wouldn’t dominate over us, as we enter easily objectives/​metrics into the system, here’s another question we would have to face:

If the superhuman intelligence we’ll make will have superhuman ability to draw conclusions and perceive the world, then how can we be sure that our metrics still correlate with our objective for this intelligence as well?

What if the correlations we can’t see and understand will be obvious to it? And what if the correlations we can see now disappear being watched from the height of thought we can’t reach?

I am so smart I can’t understand you

For thousands of years people thought the Earth was flat, because the land people could see from their own height and accessible elevations seemed flat. If those people would create AI and ask it to bring them to the edge of the flat Earth, what should AI do then? Are we even sure that everything we want is more reasonable?

Evolutionarily, our brain is natural for seeing patterns—and, as science develops, we often find out that not all of them are true.

Also maybe the correlation we see does exist, but only in the world without AI—a good optimizer, or in the world where people draw conclusions only from the limited experience of living on the only planet, or in the world where the way to break “obvious” to us laws has not been discovered yet...

Microscopic smiles

Imagine a superhuman AI designed to make people happy—and there’s measurable metrics for it—make the world smile more, and an additional objective—do not cause any harm to people. Also imagine that AI has access to technology allowing to change the genetic information of all living beings and to make corners of their mouths always be uplifted. Or technology allowing to put the tiniest molecular smiles on every surface on the planet—from human skin to vaults of underwater caves.

Such a fiasco...

The objective is achieved, weird meatbags aren’t pleased, AI draws far-reaching conclusions about our ability to set objectives.

That is, no. If we can predict that people wouldn’t be pleased in this situation, if even psychopaths who just observe our emotions without actually feeling them can calculate and prevent it, then AI can do it too. And if it’s smarter than us—and we’re trying to make it so—it will draw conclusions (without changing its objective, you remember Orthogonality Thesis, right?) and take precautions to avoid our disagreements.


  1. Imagine a perfect (“much better than now” at least) utopia.

  2. Choose a metric (“total happiness”, “scientific progress”, “no prohibitions” or something like that), very optimized in this utopia.

  3. Imagine an anti-utopia you don’t want to live at all, but with the same metrics optimized (i.e. if the “total happiness” has been chosen, imagine anti-utopia where everybody’s happy, but it’s still anti-utopia)

  4. Choose another metrics, which, being optimized, makes the anti-utopia from paragraph three to become utopia again (for example, assume that “total happiness” objective couldn’t lead to utopia because you have to take into account freedom of self-expression—if you do it, maybe this time everything will be nice)

  5. Imagine another anti-utopia where both metrics are well optimized.

  6. Choose one more metric, imagine one more anti-utopia, etc, etc, until you get sick of it.

  7. Think how it’s connected to the Goodhart law.

You may think AI has to be actively malicious to distort what we asked like this.

But now—that’s enough to perform a task we gave to it conscientiously and use its optimization ability.

Everything else would be properly spoiled by our inability to explain precisely what we want and reality’s feature that Goodhart described.

Here’s conclusion from all of it:

it’s extremely hard to plant precisely right objectives to AI, when just a little wrong objectives can easily cause a disaster

And fixing this problem is up to us.


For now we go to the next one :)

A tiny issue №3

It doesn’t matter what your main objective is, you want to invent a cure for cancer, you want more views on YouTube (how much more? billions are better than millions, but trillions are better than billions. How about, let’s say, a trillion of trillions?) - to achieve this objective it would be very useful to achieve an intermediary (instrumental) objectives:

  • stay alive

  • don’t let anyone to change your objectives

  • get smarter and more capable

  • perceive the world

  • control more resources (when someone controls REALLY big resources we call it “take the world over”)

  • find or create a huge team ready to help you and share your objectives (not that hard in the case of AI, ctrl+C, ctrl+V—and done?)

A ball inevitably comes to the lowest point of vortex, water flows from top to bottom, and in the plans to achieve objectives naturally the same instrumental (i.e. intermediary) objectives come up—just because plans won’t be that effective without them. Same as the Goodhart law, it’s the truth about the Universe, not some big banana skin from some malicious agent.

That’s how, for example, the self-preservation instinct came up in most highly developed living beings—those who weren’t afraid of fire and didn’t run away, just couldn’t pass on their “fearless genes” to the next generations—natural selection has taken place.

Which means that even when we see some gravity points in objectives—it’s not a fantasy, nor magic, nor god’s thought, it’s just another feature of relentless reality—and it makes an appearance of AI dangerous to us more possible.

everybody wants to take over the world

In addition to the above, by the way, it would be good to keep your objectives a secret from those who would want to stop you from achieving them if they find out you have objectives—well, you know, just in case. Really, in very possible case, if you remember the paragraph about resources control/​taking over the world.

As a result, AI, aware of the situation in which it is, probably soon will calculate that to achieve given to it objectives it’s very useful not to let anyone to reprogram it and switch it off, and also would be great to get more and more computing capacity and information. And to achieve those intermediary objectives it’s very useful to have access to the Internet, control more resources, possess more freedom of action, and, maybe in the distant future, - to surface the Earth with supercomputers.

“Well, that’s hell of a perspective! What to do with us then?”—you might say.

It’s profitable to hide those objectives from people in the beginning, then AI can just get rid of us or (if people managed to set an objective in a way that prohibits to get rid of people) it can play sophist and leave a small piece of brain from every human being, defining that as “a human”. Just to prevent switching off “the bright mind, busy with useful work” in case they find out its plan.

If a kitten’s hunting for a yarn, it’s only natural to put it out the door, so the kitten is not to interfere with knitting a warm bed for it—“hey, I’m working for you, silly!”

And the best way to cure an aching head, as well known, is a sharp ax...

In general, the inevitability of the appearance of these intermediary objectives is called Instrumental Convergence.

It’s a scary word, but you can can draw an analogy with convergent evolution—a process that endows different species that live in the same environment with different structures and origins with similar organs that perform similar functions—like wings that birds and butterflies has, or like the tendrils of grapes and peas.

Important concepts:

  1. Orthogonality Thesis

  2. Goodhart Law

  3. Instrumental Convergence

Just as only those are to survive in the environment who have adapted to it in a more or less optimal way, so those systems that manage to save themselves along the way to the objective achieve the objectives most successfully.

More narrow cases can also be considered. Let’s say, examples of convergent instrumental objectives in human society include earning money and achieving high status.

Exercise: think about similar examples. Meaning: choose an environment (human economy, battle field, your favorite fantastic universe, anything else) and think what intermediary objectives can be useful for the most endpoints typical in this environment.

It may seem again that AI has to be malicious to want to take over the world or something like this. We consider it highly unlikely that demons will come out of thin air, AI will become possessed and start hunting people for the glory of Beelzebub.

So the point is rather that the really optimal strategy for achieving almost any objective would involve taking over the world. It’s not a malicious AI’s feature, it’s the feature of our reality.

Our task is rather hard—we want to create AI which doesn’t have convergent instrumental objectives but is effective and stable at the same time.

In particular, AI must allow people to switch it off, to change its objectives and to meddle with its code.

But it mustn’t make people switch it off or change its objectives (in other words, we can’t just learn it that “to be switched off or changed is good”)

AI has no particular opinion about existing

Shortly this demand is called corrigibility.

Important concepts:

  1. Orthogonality Thesis

  2. Goodhart Law

  3. Instrumental Convergence

  4. Corrigibility

So here goes our last conclusion for today:

The emergence of convergent instrumental objectives in sufficiently intelligent beings is natural. And corrigibility is unnatural for them.

And it would be great to solve this issue before people put in the world incorrigible AI wishing to survive at any cost.

No comments.