Lately I’ve been spinning up on natural abstractions, and in particular on John Wentworth’s work on natural latents. As I’ve been studying, I’ve noticed some big gaps in the existing literature. Some of my biggest questions have not been answered by existing blog posts and writeups.

One of my grumps about the existing body of work has to do with the typology of concepts, and the representative examples we’re using for that typology.

If we’re going to do a lot of work to talk about concepts using math, I’m going to want to work a bunch of concrete examples to some level of precision. So far I’m not happy with the list of examples, and I’m not happy with the level of hand-waving in tying the math back to the various kinds of examples.

It seems to me that there are a lot of different kinds of concepts. Some concepts are “more abstract” than others – or to put it another way, some concepts map back very clearly to the physics of our universe, while others seem more fuzzy, hard to pin down, and maybe not “natural” at all. Some concepts are big clusters containing lots of varying examples; some attempt to capture one instance of a thing. Some concepts have to do with relationships between other concepts. Some concepts are reflective. And so on.

I think it would be a mistake to try to build a full concept typology at this point. Ideally you want the structure of the environment you’re modeling to dictate the concept typology, not the other way around. That said, I do long to have set of example concepts to draw from as I work through some of my questions about the natural latents math – and for that set to span a bunch of different types of concepts. So I’ve cheated and used my own experience as an agent thinking about concepts to guess at some important and interesting concept types.

In this post I’ll give some probably-familiar background about what we mean by concepts, and then I’ll gesture vaguely in the direction of what we need in our concept typology.

Concepts that Bind to Reality

This section is a brief foundational primer; there’s nothing new here. Readers already familiar with the existing literature on natural abstractions can skip to the next section.

Here we have two dudes looking at, and thinking about, a tree. (One of the dudes happens to be a human and one happens to be a robot.)

We want to know:

Do they each think about “the tree” as, like, a thing?
If they try to talk to each other about the tree, will that work? Will they be talking about the same thing?
Basically – do they pluck out the same concepts from their environment? How? Why? How reliable is that? What are the preconditions? etc.

Why do we care?

We care because:

We hope to understand how AIs work.
- How do they represent and manipulate concepts, including fairly sophisticated concepts?
- What are they thinking about at any given time?
- This is a fairly deep version of mechanistic interpretability. Done well, it would go way beyond locating the Eiffel Tower neuron in a neural net and let us capture much more complicated thought patterns.
We hope to communicate effectively with AIs.
- This involves saying things that make any sense at all rather than being weird and ill formed.^[1]
- It also involves saying things that the AI understands the same way we do.^[2]

Now let’s define the terms in the section title.

“Reality”

By “reality” we mostly mean physics things. States of matter in the universe. Or at least, that’s where we start.

“Concepts”

By “concepts” we mean ideas that live in the mind of an agent (human, alien, AI).^[3]

We are (almost never) specifying a state of matter very precisely, so concepts are (almost always) higher level, more abstract or categorical than that. Do you (think you) know approximately what a “dog” is, or could you at least pick one out of a lineup of various mammals? Then you have a “dog” concept.

There are some kinds of concepts we’re not talking about, at least not yet. We’re punting on parts of speech other than nouns and noun phrases, because nouns alone are going to be plenty of work. We’re also not going to get into concepts that don’t bind to physics-reality at all but are still interesting – for example, we won’t talk about mathematical concepts from group theory.

“Bind to”

By “bind to,” we mean creating reliable and consistent mappings between concepts and reality.

We don’t need to bind incredibly precisely – having some sloppiness here is, in one sense, the point; abstraction necessarily involves loss of precision.

And we expect different minds to have different concepts for lots of reasons. The most obvious is that they may have been exposed to different environments. But holding the environment constant, they may have had different experiences of that environment, different sensory apparatus, or they may just be bad at reasoning, inference, or generalization. Most agents are not ideal!

When we say that a concept binds to reality, we’re claiming that the agent can derive solid predictive power from that concept. Their idea of a “tree” captures some fundamental tree-ness that allows them to recognize other examples of trees and make correct predictions about the properties of those new trees.

We’re also saying that the agent has gone beyond memorization of multiple individual examples and they’ve generalized, they’ve captured some structure in the environment and encoded it. Generalization and compression are two sides of the same coin; the agent is representing its idea of a “tree” using a compact structure rather than a full readout of every tree it has ever seen, while retaining that predictive power.

The case for building a half-assed concept typology with representative examples

In their work on natural latents, John and David use a few examples repeatedly. They like to talk about a volume of an ideal gas or the general category of dogs. Sometimes they talk about teacups, biased coins, or Ising models. They like trees. (I guess I like trees too. I opened with a tree example.)

They very rarely talk about anything super abstract and fuzzy, like “friendship” or “loyalty” or “beauty” or “goodness.” And yet, a lot of the discourse^[4] is about these sort of fuzzy human values, the sorts of things that might end up in a human’s CEV, and be relevant to broader alignment questions.

I’d like to do a little bit – but not a lot – better with concept typology. I’m not looking to reinvent the field of semantics from scratch as a side gig, nor do I want to be so fiddly with this that I end up trying to choose the ontology myself; that never works. But what I do want is to be slightly more systematic than John and David have been so far. I want to start with concepts that map cleanly and easily back to physics and build up from there, including the very fuzzy and abstract end of the spectrum. I want to do a better job with the category vs. instance distinction. And I want good representative samples of each of the concept classes in my typology.

More importantly, when we get around to actually constructing (possibly-natural) latents for these concepts, I’d like to do that a little more slowly and carefully, with moderately less handwaving.

I want to do this mostly to prove to myself that I can, that I’ve actually understood how all of this machinery is supposed to work.

And as we build out new bits of machinery to work with (natural) latents, I want to have a sort of test suite of examples to run through, to make sure everything works, kind of like unit tests in software.

My initial brainstorm of example concepts

Here’s what I’ve got so far.

Well, first off, there’s a wide world of parts of speech. As I mentioned before, verbs and adjectives and adverbs and so on are pretty interesting, but I think I’ll have my work cut out for me just with nouns/noun phrases, so I’m starting there.

There’s also a wide world of relationships between concepts, like time, causality, locality, and so on. I’m ignoring all that for now too.

Within nouns, I’m definitely interested in objects – categories of objects and also specific individual objects. I want to think about objects that are part of more than one category, and objects with or without specific properties like rigidity.

I’m interested in biological entities with at least some agentic properties. Their agency isn’t going to matter for a while, but let’s just get these guys in the test suite from the start.

And yes, I want to spend at least a little time on very abstract concepts, perhaps ones dealing with how agentic beings interact with each other.

So that led me to the following list for my nascent concept typology test suite:

objects (categorical and individual)
- approximately rigid-body
  - the category of balls (as in, round objects good for throwing)
  - a specific ball affectionately known as Bluey
  - the category of oranges (the fruit)
- an enclosed volume of ideal gas
agentic beings (categorical and individual)
- dogs in general
- my (fictitious) beloved specific dog Fido
concepts with much less direct relationships to states of matter
- the pecking order in a flock of chickens
- monogamy
- consciousness

This will probably not be enough! Not even close! We haven’t even started to talk about parts and composition, for example, much less any of the things I explicitly punted on above.

But it’s a start, and these are the examples I’ll come back to in my next few posts about concepts.

Your nominations for additions to my list will be considered, and frankly, probably discarded, because wow there’s a lot of work to do as it is. But please go ahead and make ’em anyway, if you like.

^
This is one aspect of the Pointers Problem: “Some of the things I value may not actually exist—I may simply be wrong about which high-level things inhabit our world.”
^
See also: Interoperable Semantics.
^
What kind of agent? In brief, an embedded agent.
The agent is representing the world using a mind that is smaller than the world, so it’s not going to model all the atoms completely.
It doesn’t have clean I/O with the world, just various sensory data that is probably heavily filtered and aggregated, comes from a certain viewpoint, etc.
And, while it’s not very cruxy at the moment, the agent also needs to model other agents in the world and communicate with them.
^
Like this comment, for example, in which Eliezer is concerned that an AI might not share any reflective concepts with humans at all, or this post, in which Charlie Steiner is concerned that concepts for human values will be too numerous.

A Test Suite for Concepts