Natural Categories Update

One concern in the AI-Alignment problem is that neural networks are “alien minds”. Namely the representations that they learn of the world are too weird/​different to allow effect communication of human goals and ideas.

For example, EY writes

there is no known way to use the paradigm of loss functions, sensory inputs, and/​or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment

Recent developments in neural networks have led me to think this is less likely to be a problem. Hence, I am more optimistic about AI alignment on the default path.

For example, this paper

One excellent project I would suggest anyone looking to “get a feel” for how natural categories work in contemporary machine learning is Textual Inversion. This allows you to “point to” a specific spot in the latent space of an image-to-text model using only a few images.

Importantly, textual inversion models can be trained on both objects and styles.

The Natural Category Hypothesis

Suppose that we assume an extremely strong version of the “natural categories” hypothesis is true:

Any concept that can be described to a human being already familiar with that concept using only a few words or pictures is a natural category. Such a concept can also be found in the embedding space of a trained neural network. Furthermore, it is possible to map a concept from one neural network onto another new network.

Implications for Alignment

What problems in AI-Alignment does this make easier?

  1. It means we can more easily explain concepts like “produce paperclips” or “don’t murder” to AIs

  2. It means interetable AI is much easier/​more tractable

  3. Bootstrapping FAI should be much easier since concepts can be transferred from a weaker AI to a more powerful one

In the most extreme case, if FAI is a natural category, then AI Alignment practically solves itself. (I personally doubt this is the case, since I think like “general intelligence”, “friendly intelligence” is a nebulous concept with no single meaning)

What problems do natural categories not solve?

  1. Goodhearting. Even if it’s easy to tell an AI “make paperclips”, it doesn’t follow that it’s easy to tell it “and don’t do anything stupid that I wouldn’t approve of while you’re at it”

  2. Race Conditions, coordination problems, unfriendly humans

  3. Hard Left Turns.

What updates should you take from this?

If you previously thought that natural categories/​pointing to things in the world was a major problem for AI Alignment and this research comes as a surprise to you, I would suggest the following update:

Spend less of your effort worrying about specifying correct utility functions and more of it worrying about coordination problems.