If Wentworth is right about natural abstractions, it would be bad for alignment

This post was written as part of the AI safety Mentors and Mentees program. My Mentor is Jacques Thibodeau.

In this post, I will distinguish between two hypotheses that are often conflated. To disambiguate, I first suggest two different names for these hypotheses so I can talk about them separately:

The natural abstraction hypothesis (NAH):

There are natural ways to cut the world up into concepts. A lot of very different cognitive systems will naturally converge to these abstractions. So there is reason to believe that AIs will also form concepts of abstractions that humans use (nails, persons, human values….).

The Wentworthian abstractions hypothesis (WAH):

There are natural abstractions, and they are identified by the properties of a system, that are relevant for predicting how far away objects behave.

Notice how the first might be true, while the second might be completely off. Just as you can deny that Newtonian mechanics is true, without denying that heavy objects attract each other.

Why natural abstractions are thought to be good for alignment

If NAH turns out to be correct, this would simplify two problems in alignment.

1. Interpretability

If the AI uses the same abstractions as us, it is probably way easier to read its mind.

2. Pointing at things

If the AI forms the abstraction “diamond” itself, we could just point at that abstraction in the AI’s mind, and say: “maximize that one”, instead of trying to formulate what a diamond is rigorously. This was proposed in combination with shard theory to the diamond-alignment problem. If it would naturally form an abstraction of human values, alignment might be easier than we thought (alignment by default). We could point at that abstraction by training it in such a way, that it adheres to those values.

Wentworthian abstractions are about outer appearance, not inner structure

Wentworths hypothesizes that natural abstractions consist of information that is relevant from afar. Let’s take the example of a Nail. A particular Nail consists of billions of Atoms, and there is an overwhelming amount of facts that could be stated about these particles. However, when you see a Nail on the other side of the room, you only tend to think about some of these facts. Its elongated shape, its color, and its pointiness for example. Let’s say these are the only abstract information about these billions of particles that make you identify them as a nail (rather than the information that any particular atom is any particular place).

This means that certain abstractions are bared, from being Wentworthian abstractions: Inner structure. Because Wentworthian abstractions only consist of the information that is relevant far away, the specific way that an object looks like up close is only a Wentworthian abstraction in so far as it influences the properties that are relevant far away. For example, the nail consists of iron atoms and not gold. If the nail would be made of gold, it would have a different color and weight. However, this exact color and weight are not unique to iron. Presumably, you could mix other metals in such a way, that color and weight would be identical. Then this new metal would share all Wentworthian abstractions with the iron nail, even though having a completely different inner structure.

If WAH is true, this would be bad for alignment

I think, if the Wentworthian abstractions hypothesis is true, plans like “solving the diamond maximization problem” or “alignment by default” would fail. Here is why:

Let’s take the example of diamond maximization first:

Suppose NAH&WAH are true, and we have adequate interpretability tools.

You train an AI in an environment that contains diamonds. Due to natural abstractions, it forms a diamond abstraction. You can use your interpretability tools to see that it has a diamond abstraction. You give the AI reward if it is around diamonds/​acquires diamonds. It forms a diamond-shard (learns to value diamonds terminally). You use your interpretability tools to verify: It does actually value diamonds. You let it loose in the world and it creates lots of diamonds.

If WAH is true, it will recognize/​identify diamonds by the information about diamonds that are preserved over distance in a noisy environment: their hardness, their shininess, their density… It will value that abstraction, so it will steer the world in such a way that it contains things that look from afar to be hard, shiny, and dense. Importantly, the WAH predicts that the molecular pattern of diamonds (a carbon atom, covalently bound to 4 other carbon atoms) will not be part of the diamond abstraction. It is not in itself information, that is itself relevant over distance. Sure, the atomic structure determines the density, hardness, and shininess, but it might not be unique in those properties. The AI does not care about the atomic structure, it maximizes its abstraction of diamonds. So whether it will actually produce lots of diamonds depends on whether it finds some cheaper way to produce objects that are shiny, hard, and dense. Since the AI is smart, it will probably find a material with those properties that are easier to produce. So it will tile the universe with “diamonds″ and not with diamonds.

Let’s see if the same thing happens when we look at the human value case. You expose the AI to a lot of data about humans. It forms an abstraction of human values. You then program the AI in such a way that it behaves in accordance with those values. You use your interpretability techniques, to verify that the human values it identifies are plausible (something like: “It is good when Persons are happy,...”). The AI is then released and acts in accordance with those values.

The critical piece here is that the human values the AI identifies probably refer to other abstractions (such as persons). If the value is “It is good, when persons are happy”, then this value is only meaningfully executed when the AI has the proper definition of what ”Person” means. And what does the AI identify as a person? The WAH has an answer to that: Anything that has the proper effect far away. So anything that looks like a Person, behaves to a stimulus like a person and talks like a person. The WAH explicitly predicts that the inner computations that are typical for persons are not for themselves relevant to the AI’s person abstraction.

So the AI fills the universe with trillions of happy persons, living fulfilling lives full of Truth and Beauty. But if it ever finds some inner structure, that looks like a person from the outside (and speaks like a person, …) but is cheaper to produce and has a completely different inner structure (for example way simpler non-conscious GPT-style simulations of humans), it will happily tile the Universe with those “Persons”.

The WAH predicts that most minds will only form abstractions according to outward appearance, not according to their inner structure. So whenever we value a concept due to its inner structure, the WAH predicts that AIs that want to maximize that abstraction will be naturally misaligned. They will throw the inner structure under the bus, to better maximize the outward appearance.

How can we fix this?

I see two ways, in which it could be possible to build AIs that do care about inner structure.

  1. Wentorthisn abstractions are not the correct framework for natural abstractions. It could be that most minds actually don’t form concepts purely by outer appearance. It might be that I misunderstood John Wentworth’s ideas or that his approach is wrong. What makes this possibility plausible is that humans seem to care about inner structure (for example Persons vs. Non-sentient “Persons”).

  2. Mind designed, where caring about inner structure is hard-coded in. An example of this could be infrabayesian agents. In Infrabayesiansim the argument of the utility function is, which computations are running in the universe. This is only a function of inner structure, not outward appearance. Unfortunately, we do not know how to actually build an infrabayesian agent.