A Test Suite for Concepts
Lately I’ve been spinning up on natural abstractions, and in particular on John Wentworth’s work on natural latents. As I’ve been studying, I’ve noticed some big gaps in the existing literature. Some of my biggest questions have not been answered by existing blog posts and writeups.
One of my grumps about the existing body of work has to do with the typology of concepts, and the representative examples we’re using for that typology.
If we’re going to do a lot of work to talk about concepts using math, I’m going to want to work a bunch of concrete examples to some level of precision. So far I’m not happy with the list of examples, and I’m not happy with the level of hand-waving in tying the math back to the various kinds of examples.
It seems to me that there are a lot of different kinds of concepts. Some concepts are “more abstract” than others – or to put it another way, some concepts map back very clearly to the physics of our universe, while others seem more fuzzy, hard to pin down, and maybe not “natural” at all. Some concepts are big clusters containing lots of varying examples; some attempt to capture one instance of a thing. Some concepts have to do with relationships between other concepts. Some concepts are reflective. And so on.
I think it would be a mistake to try to build a full concept typology at this point. Ideally you want the structure of the environment you’re modeling to dictate the concept typology, not the other way around. That said, I do long to have set of example concepts to draw from as I work through some of my questions about the natural latents math – and for that set to span a bunch of different types of concepts. So I’ve cheated and used my own experience as an agent thinking about concepts to guess at some important and interesting concept types.
In this post I’ll give some probably-familiar background about what we mean by concepts, and then I’ll gesture vaguely in the direction of what we need in our concept typology.
Concepts that Bind to Reality
This section is a brief foundational primer; there’s nothing new here. Readers already familiar with the existing literature on natural abstractions can skip to the next section.
Here we have two dudes looking at, and thinking about, a tree. (One of the dudes happens to be a human and one happens to be a robot.)

We want to know:
Do they each think about “the tree” as, like, a thing?
If they try to talk to each other about the tree, will that work? Will they be talking about the same thing?
Basically – do they pluck out the same concepts from their environment? How? Why? How reliable is that? What are the preconditions? etc.
Why do we care?
We care because:
We hope to understand how AIs work.
How do they represent and manipulate concepts, including fairly sophisticated concepts?
What are they thinking about at any given time?
This is a fairly deep version of mechanistic interpretability. Done well, it would go way beyond locating the Eiffel Tower neuron in a neural net and let us capture much more complicated thought patterns.
We hope to communicate effectively with AIs.
Now let’s define the terms in the section title.
“Reality”
By “reality” we mostly mean physics things. States of matter in the universe. Or at least, that’s where we start.
“Concepts”
By “concepts” we mean ideas that live in the mind of an agent (human, alien, AI).[3]
We are (almost never) specifying a state of matter very precisely, so concepts are (almost always) higher level, more abstract or categorical than that. Do you (think you) know approximately what a “dog” is, or could you at least pick one out of a lineup of various mammals? Then you have a “dog” concept.
There are some kinds of concepts we’re not talking about, at least not yet. We’re punting on parts of speech other than nouns and noun phrases, because nouns alone are going to be plenty of work. We’re also not going to get into concepts that don’t bind to physics-reality at all but are still interesting – for example, we won’t talk about mathematical concepts from group theory.
“Bind to”
By “bind to,” we mean creating reliable and consistent mappings between concepts and reality.
We don’t need to bind incredibly precisely – having some sloppiness here is, in one sense, the point; abstraction necessarily involves loss of precision.
And we expect different minds to have different concepts for lots of reasons. The most obvious is that they may have been exposed to different environments. But holding the environment constant, they may have had different experiences of that environment, different sensory apparatus, or they may just be bad at reasoning, inference, or generalization. Most agents are not ideal!
When we say that a concept binds to reality, we’re claiming that the agent can derive solid predictive power from that concept. Their idea of a “tree” captures some fundamental tree-ness that allows them to recognize other examples of trees and make correct predictions about the properties of those new trees.
We’re also saying that the agent has gone beyond memorization of multiple individual examples and they’ve generalized, they’ve captured some structure in the environment and encoded it. Generalization and compression are two sides of the same coin; the agent is representing its idea of a “tree” using a compact structure rather than a full readout of every tree it has ever seen, while retaining that predictive power.
The case for building a half-assed concept typology with representative examples
In their work on natural latents, John and David use a few examples repeatedly. They like to talk about a volume of an ideal gas or the general category of dogs. Sometimes they talk about teacups, biased coins, or Ising models. They like trees. (I guess I like trees too. I opened with a tree example.)
They very rarely talk about anything super abstract and fuzzy, like “friendship” or “loyalty” or “beauty” or “goodness.” And yet, a lot of the discourse[4] is about these sort of fuzzy human values, the sorts of things that might end up in a human’s CEV, and be relevant to broader alignment questions.
I’d like to do a little bit – but not a lot – better with concept typology. I’m not looking to reinvent the field of semantics from scratch as a side gig, nor do I want to be so fiddly with this that I end up trying to choose the ontology myself; that never works. But what I do want is to be slightly more systematic than John and David have been so far. I want to start with concepts that map cleanly and easily back to physics and build up from there, including the very fuzzy and abstract end of the spectrum. I want to do a better job with the category vs. instance distinction. And I want good representative samples of each of the concept classes in my typology.
More importantly, when we get around to actually constructing (possibly-natural) latents for these concepts, I’d like to do that a little more slowly and carefully, with moderately less handwaving.
I want to do this mostly to prove to myself that I can, that I’ve actually understood how all of this machinery is supposed to work.
And as we build out new bits of machinery to work with (natural) latents, I want to have a sort of test suite of examples to run through, to make sure everything works, kind of like unit tests in software.
My initial brainstorm of example concepts
Here’s what I’ve got so far.
Well, first off, there’s a wide world of parts of speech. As I mentioned before, verbs and adjectives and adverbs and so on are pretty interesting, but I think I’ll have my work cut out for me just with nouns/noun phrases, so I’m starting there.
There’s also a wide world of relationships between concepts, like time, causality, locality, and so on. I’m ignoring all that for now too.
Within nouns, I’m definitely interested in objects – categories of objects and also specific individual objects. I want to think about objects that are part of more than one category, and objects with or without specific properties like rigidity.
I’m interested in biological entities with at least some agentic properties. Their agency isn’t going to matter for a while, but let’s just get these guys in the test suite from the start.
And yes, I want to spend at least a little time on very abstract concepts, perhaps ones dealing with how agentic beings interact with each other.
So that led me to the following list for my nascent concept typology test suite:
objects (categorical and individual)
approximately rigid-body
the category of balls (as in, round objects good for throwing)
a specific ball affectionately known as Bluey
the category of oranges (the fruit)
an enclosed volume of ideal gas
agentic beings (categorical and individual)
dogs in general
my (fictitious) beloved specific dog Fido
concepts with much less direct relationships to states of matter
monogamy
consciousness
This will probably not be enough! Not even close! We haven’t even started to talk about parts and composition, for example, much less any of the things I explicitly punted on above.
But it’s a start, and these are the examples I’ll come back to in my next few posts about concepts.
Your nominations for additions to my list will be considered, and frankly, probably discarded, because wow there’s a lot of work to do as it is. But please go ahead and make ’em anyway, if you like.
- ^
This is one aspect of the Pointers Problem: “Some of the things I value may not actually exist—I may simply be wrong about which high-level things inhabit our world.”
- ^
See also: Interoperable Semantics.
- ^
What kind of agent? In brief, an embedded agent.
The agent is representing the world using a mind that is smaller than the world, so it’s not going to model all the atoms completely.
It doesn’t have clean I/O with the world, just various sensory data that is probably heavily filtered and aggregated, comes from a certain viewpoint, etc.
And, while it’s not very cruxy at the moment, the agent also needs to model other agents in the world and communicate with them.
- ^
Like this comment, for example, in which Eliezer is concerned that an AI might not share any reflective concepts with humans at all, or this post, in which Charlie Steiner is concerned that concepts for human values will be too numerous.
Me: “I finished my post!”
Eliezer: “What’s it about?”
Me: explains
Eliezer: “Oh, I’ve got some more concepts for you!”
Me: “Oh no.”
Eliezer: “The inside of a brick (from the Feynman story), incorrect meta-ethics, medianworld, threat, leftism, miracle, eugenics, pornography, sin, Chaotic Neutral, paradox, unicorn, magic, the least non-interesting number, …”
Me: “Please stop?”
Eliezer, grinning: “Non-nouns!”
Me: “Ahhhhh”
The world is complicated.
In the Platonic Representation Hypothesis, this actually makes identifying the homeomorphism between two approximations to it easier.
I propose “honesty”. Justification:
It just seems fundamental to lots of alignment work (deception, Claude being honest according to its constitution, also it’s one H in HHH)
It’s genuinely unclear to me whether “honesty” makes any sense from the POV of a goal-directed agent, especially superintelligent.
Example: consider ant traps. They make ants think they carry home tasty nutrients, while in fact they’re carrying poison, so the ants are deceived. Would we say that a human setting up a trap is “dishonest”?
I’d say—not really, because honesty happens in communication between agents, and we don’t consider setting up a trap as an “act of communication”.
But why do we consider this to not be communication? Probably because we don’t think of ants as “agents we communicate with”.
OK but why are ants not agents we communicate with? And why would a superintelligent AI treat humans differently?
So. I’m worried that if e.g. honesty makes sense only between agents-on-similar-level-who-trade-with-each-other, then all our efforts to make AIs honest and not deceptive are useless.
Slightly varied example: is laying ambushes for enemy humans dishonest, during war?
It’s certainly deceptive. But I feel hesitant to lump it and “normal dishonesty” together, because I think there is some qualitative difference between degrading the commons and Winning At Conflict.
It’s dishonest (and quite bad) to wave a white flag of surrender, and then lure people into a trap (compared to leaking bad information to a spy to lure enemies into a trap). Because Surrender is a mode of communication that enemies both agree is good to have open.
Honesty is in the same category as “trouble”, “roadblock”, “pest”, “flourishing”, “contamination”, etc. (and probably all human concepts to some nonzero extent), wherein a normative judgment of whether the thing is good or bad is effectively wrapped up into the definition.
In the case at hand, honesty has a connotation of the behavior being good, and dishonesty has a connotation of the behavior being bad, as those terms are used in practice. And laying-ambushes-during-war is at least sometimes good. So laying-ambushes-during-war is at most an edge case of dishonesty, for that reason alone, even if it matched dishonesty in every other respect.
See my discussion in Valence series §3.4.
I guess some emotive conjugations here might be: “I’m strategically deceptive, you’re dishonest”, and “I’m honest, you’re technically not exactly lying but…”.
I find arc tasks are a good way to generate lots of simple, concrete example concepts.
Concepts where (mostly) everyone agrees that not everyone agrees about whether the thing exists or not, but if it exists, it probably has agency, like divinities.
Orders that people seem to agree exist, even though it’s unclear how to judge them, like “the best band in the world”
Objects that when imprecisely called can be either rigid body or ideal gas, like “water”
Studies show that humans have shape bias compared to some other animals (e.g. dogs). Dogs tend to categorize more based on size and texture. I’m not sure what this tells us. Maybe it just points to useful concepts being goal-dependent?
It may be interesting to compare languages and look for concepts only present in some. The examples I could find seem to be from categories other than nouns (e.g. left/right, specific colors, numbers, time).
Yeah there’s no doubt in my mind that the concept representation, concept relationships/nesting, etc. is going to have a lot to do with the agent’s actual exposure to the environment (both due to what lived experience they have and what sensory organs they have) and also to the relevance of parts of the environment to their causal model/goals/utility function. So different agents will end up with different concepts, differently fleshed out, because of all of those things.
AND YET we still hope that some concepts are just so useful that ~all agents will pick them up.
Luckily this is an empirical claim that we will eventually figure out how to test.
(Your initial brainstorm reminded me of John Nerst’s Big List of Existing Things post.)
This sounds very interesting. I would like to add the concept of chair to the list. Chairs are pretty physically varied, rocking chairs, stools, reclining chairs, office chairs, airplane seats…
I have a worry that “chair” isn’t actually a natural abstraction, that the thing that unifies these categories are that they are suitable for sitting a single person, which is a different kind of unification than dog or ideal gas.