Humans provide an untapped wealth of evidence about alignment

This post has been recorded as part of the LessWrong Curated Podcast, and can be listened to on Spotify, Apple Podcasts, and Libsyn.

TL;DR: To even consciously consider an alignment research direction, you should have evidence to locate it as a promising lead. As best I can tell, many directions seem interesting but do not have strong evidence of being “entangled” with the alignment problem such that I expect them to yield significant insights.

For example, “we can solve an easier version of the alignment problem by first figuring out how to build an AI which maximizes the number of real-world diamonds” has intuitive appeal and plausibility, but this claim doesn’t have to be true and this problem does not necessarily have a natural, compact solution. In contrast, there do in fact exist humans who care about diamonds. Therefore, there are guaranteed-to-exist alignment insights concerning the way people come to care about e.g. real-world diamonds.

“Consider how humans navigate the alignment subproblem you’re worried about” is a habit which I (TurnTrout) picked up from Quintin Pope. I wrote the post, he originated the tactic.

A simplified but still very difficult open problem in AI alignment is to state an unbounded program implementing a diamond maximizer that will turn as much of the physical universe into diamond as possible. The goal of “making diamonds” was chosen to have a crisp-seeming definition for our universe (the amount of diamond is the number of carbon atoms covalently bound to four other carbon atoms). If we can crisply define exactly what a ‘diamond’ is, we can avert issues of trying to convey complex valuesinto the agent.

Ontology identification problem, Arbital

I find this problem interesting, both in terms of wanting to know how to solve a reframed version of it, and in terms of what I used to think about the problem. I used to[1] think, “yeah, ‘diamond’ is relatively easy to define. Nice problem relaxation.” It felt like the diamond maximizer problem let us focus on the challenge of making the AI’s values bind to something at all which we actually intended (e.g. diamonds), in a way that’s robust to ontological shifts and that doesn’t collapse into wireheading or tampering with e.g. the sensors used to estimate the number of diamonds.

Although the details are mostly irrelevant to the point of this blog post, the Arbital article suggests some solution ideas and directions for future research, including:

  1. Scan AIXI-tl’s Turing machines and locate diamonds within their implicit state representations.

  2. Given how inaccessible we expect AIXI-tl’s representations to be by default, have AIXI-tl just consider a Turing-complete hypothesis space which uses more interpretable representations.

  3. “Being able to describe, in purely theoretical principle, a prior over epistemic models that have at least two levels and can switch between them in some meaningful sense”

Do you notice anything strange about these three ideas? Sure, the ideas don’t seem workable, but they’re good initial thoughts, right?

The problem isn’t that the ideas aren’t clever enough. Eliezer is pretty dang clever, and these ideas are reasonable stabs given the premise of “get some AIXI variant to maximize diamond instead of reward.”

The problem isn’t that it’s impossible to specify a mind which cares about diamonds. We already know that there are intelligent minds who value diamonds. You might be dating one of them, or you might even be one of them! Clearly, the genome + environment jointly specify certain human beings who end up caring about diamonds.

One problem is where is the evidence required to locate these ideas? Why should I even find myself thinking about diamond maximization and AIXI and Turing machines and utility functions in this situation? It’s not that there’s no evidence. For example, utility functions ensure the agent can’t be exploited in some dumb ways. But I think that the supporting evidence is not commensurate with the specificity of these three ideas or with the specificity of the “ontology identification” problem framing.

Here’s an exaggeration of how these ideas feel to me when I read them:

“I lost my phone”, you tell your friend.

They ask, “Have you checked Latitude: -34.44006, Longitude: -64.61333?”

Uneasily, you respond: “Why would I check there?”

Your friend shrugs: “Just seemed promising. And it’s on land, it’s not in the ocean. Don’t worry, I incorporated evidence about where you probably lost it.”

I recently made a similar point about Cooperative Inverse Reinforcement Learning:

Against CIRL as a special case of against quickly jumping into highly specific speculation while ignoring empirical embodiments-of-the-desired-properties.

In the context of “how do we build AIs which help people?”, asking “does CIRL solve corrigibility?” is hilariously unjustified. By what evidence have we located such a specific question? We have assumed there is an achievable “corrigibility”-like property; we have assumed it is good to have in an AI; we have assumed it is good in a similar way as “helping people”; we have elevated CIRL in particular as a formalism worth inquiring after.

But this is not the first question to ask, when considering “sometimes people want to help each other, and it’d be great to build an AI which helps us in some way.” Much better to start with existing generally intelligent systems (humans) which already sometimes act in the way you want (they help each other) and ask after the guaranteed-to-exist reason why this empirical phenomenon happens.

Now, if you are confused about a problem, it can be better to explore some guesses than no guesses—perhaps it’s better to think about Turing machines than to stare helplessly at the wall (but perhaps not). Your best guess may be wrong (e.g. write a utility function which scans Turing machines for atomic representations of diamonds), but you sometimes still learn something by spelling out the implications of your best guess (e.g. the ontology identifier stops working when AIXI Bayes-updates to non-atomic physical theories). This can be productive, as long as you keep in mind the wrongness of the concrete guess, so as to not become anchored on that guess or on the framing which originated it (e.g. build a diamond maximizer).

However, in this situation, I want to look elsewhere. When I confront a confusing, difficult problem (e.g. how do you create a mind which cares about diamonds?), I often first look at reality (e.g. are there any existing minds which care about diamonds?). Even if I have no idea how to solve the problem, if I can find an existing mind which cares about diamonds, then since that mind is real, that mind has a guaranteed-to-exist causal mechanistic play-by-play origin story for why it cares about diamonds. I thereby anchor my thinking to reality; reality is sturdier than “what if” and “maybe this will work”; many human minds do care about diamonds.

In addition to “there’s a guaranteed causal story for humans valuing diamonds, and not one for AIXI valuing diamonds”, there’s a second benefit to understanding how human values bind to the human’s beliefs about real-world diamonds. This second benefit is practical: I’m pretty sure the way that humans come to care about diamonds has nearly nothing to do with the ways AIXI-tl might be motivated to maximize diamonds. This matters, because I expect that the first AGI’s value formation will be far more mechanistically similar to within-lifetime human value formation, than to AIXI-tl’s value alignment dynamics.

Next, it can be true that the existing minds are too hard for us to understand in ways relevant to alignment. One way this could be true is that human values are a “mess”, that “our brains are kludges slapped together by natural selection.” If human value formation were sufficiently complex, with sufficiently many load-bearing parts such that each part drastically affects human alignment properties, then we might instead want to design simpler human-comprehensible agents and study their alignment properties.

While I think that human values are complex, I think the evidence for human value formation’s essential complexity is surprisingly weak, all things reconsidered in light of modern, post-deep learning understanding. Still… maybe humans are too hard to understand in alignment-relevant ways!

But, I mean, come on. Imagine an alien[2] visited and told you:

Oh yeah, the AI alignment problem. We knocked that one out a while back. Information inaccessibility of the learned world model? No, I’m pretty sure we didn’t solve that, but we didn’t have to. We built this protein computer and trained it with, I forget actually, was it just what you would call “deep reinforcement learning”? Hm. Maybe it was more complicated, maybe not, I wasn’t involved.

We might have hardcoded relatively crude reward signals that are basically defined over sensory observables, like a circuit which activates when their sensors detect a certain kind of carbohydrate. Scanning you, it looks like some of the protein computers ended up with your values, even. Small universe, huh?

Actually, I forgot how we did it, sorry. And I can’t make guarantees that our approach scales beyond your intelligence level or across architectures, but maybe it does. I have to go, but here are a few billion of the trained protein computers if you want to check them out!

Ignoring the weird implications of the aliens existing and talking to you like this, and considering only the alignment implications—The absolute top priority of many alignment researchers should be figuring out how the hell the aliens got as far as they did.[3] Whether or not you know if their approach scales to further intelligence levels, whether or not their approach seems easy to understand, you have learned that these computers are physically possible, practically trainable entities. These computers have definite existence and guaranteed explanations. Next to these actually existent computers, speculation like “maybe attainable utility preservation leads to cautious behavior in AGIs” is dreamlike, unfounded, and untethered.

If it turns out to be currently too hard to understand the aligned protein computers, then I want to keep coming back to the problem with each major new insight I gain. When I learned about scaling laws, I should have rethought my picture of human value formation—Did the new insight knock anything loose? I should have checked back in when I heard about mesa optimizers, about the Bitter Lesson, about the feature universality hypothesis for neural networks, about natural abstractions.

Because, given my life’s present ambition (solve AI alignment), that’s what it makes sense for me to do—at each major new insight, to reconsider my models[4] of the single known empirical example of general intelligences with values, to scour the Earth for every possible scrap of evidence that humans provide about alignment. We may not get much time with human-level AI before we get to superhuman AI. But we get plenty of time with human-level humans, and we get plenty of time being a human-level intelligence.

The way I presently see it, the godshatter of human values—the rainbow of desires, from friendship to food—is only unpredictable relative to a class of hypotheses which fail to predict the shattering.[5] But confusion is in the map, not the territory. I do not consider human values to be “unpredictable” or “weird”, I do not view them as a “hack” or a “kludge.” Human value formation may or may not be messy (although I presently think not). Either way, human values are, of course, part of our lawful reality. Human values are reliably produced by within-lifetime processes within the brain. This has an explanation, though I may be ignorant of it. Humans usually bind their values to certain objects in reality, like dogs. This, too, has an explanation.

And, to be clear, I don’t want to black-box outside-view extrapolate from the “human datapoint”; I don’t want to focus on thoughts like “Since alignment ‘works well’ for dogs and people, maybe it will work well for slightly superhuman entities.” I aspire for the kind of alignment mastery which lets me build a diamond-producing AI, or if that didn’t suit my fancy, I’d turn around and tweak the process and the AI would press green buttons forever instead, or—if I were playing for real—I’d align that system of mere circuitry with humane purposes.

For that ambition, the inner workings of those generally intelligent apes is invaluable evidence about the mechanistic within-lifetime process by which those apes form their values, and, more generally, about how intelligent minds can form values at all. What factors matter for the learned values, what factors don’t, and what we should do for AI. Maybe humans have special inductive biases or architectural features, and without those, they’d grow totally different kinds of values. But if that were true, wouldn’t that be important to know?

If I knew how to interpret the available evidence, I probably would understand how I came to weakly care about diamonds, and what factors were important to that process (which reward circuitry had to fire at which frequencies, what concepts I had to have learned in order to grow a value around “diamonds”, how precisely activated the reward circuitry had to be in order for me to end up caring about diamonds).

Humans provide huge amounts of evidence, properly interpreted—and therein lies the grand challenge upon which I am presently fixated. In an upcoming post, I’ll discuss one particularly rich vein of evidence provided by humans.

Thanks to Logan Smith and Charles Foster for feedback. Spiritually related to but technically distinct from The First Sample Gives the Most Information.

EDIT: In this post, I wrote about the Arbital article’s unsupported jump from “Build an AI which cares about a simple object like diamonds” to “Let’s think about ontology identification for AIXI-tl.” The point is not that there is no valid reason to consider the latter, but that the jump, as written, seemed evidence-starved. For separate reasons, I currently think that ontology identification is unattractive in some ways, but this post isn’t meant to argue against that framing in general. The main point of the post is that humans provide tons of evidence about alignment, by virtue of containing guaranteed -to-exist mechanisms which produce e.g. their values around diamonds.

Appendix: One time I didn’t look for the human mechanism

Back in 2018, I had a clever-seeming idea. We don’t know how to build an aligned AI; we want multiple tries; it would be great if we could build an AI which “knows it may have been incorrectly designed”; so why not have the AI simulate its probable design environment over many misspecifications, and then not do plans which tend to be horrible for most initial conditions. While I drew some inspiration from how I would want to reason in the AI’s place, I ultimately did not think thoughts like:

We know of a single group of intelligent minds who have ever wanted to be corrigible and helpful to each other. I wonder how that, in fact, happens?

Instead, I was trying out clever, off-the-cuff ideas in order to solve e.g. Eliezer’s formulation of the hard problem of corrigibility. However, my idea and his formulation suffered a few disadvantages, including:

  1. The formulation is not guaranteed to describe a probable or “natural” kind of mind,

  2. These kinds of “corrigible” AIs are not guaranteed to produce desirable behavior, but only imagined to produce good behavior,

  3. My clever-seeming idea was not at all constrained by reality to actually work in practice, as opposed to just sounding clever to me, and

  4. I didn’t have a concrete use case in mind for what to do with a “corrigible” AI.

I wrote this post as someone who previously needed to read it.

  1. ^

    I now think that diamond’s physically crisp definition is a red herring. More on that in future posts.

  2. ^

    This alien is written to communicate my current belief state about how human value formation works, so as to make it clear why, given my beliefs, this value formation process is so obviously important to understand.

  3. ^

    There is an additional implication present in the alien story, but not present in the evolutionary production of humans. The aliens are implied to have purposefully aligned some of their protein computers with human values, while evolution is not similarly “purposeful.” This implication is noncentral to the key point, which is that the human-values-having protein computers exist in reality.

  4. ^

    Well, I didn’t even have a detailed picture of human value formation back in 2021. I thought humans were hopelessly dumb and messy and we want a nice clean AI which actually is robustly aligned.

  5. ^

    Suppose we model humans as the “inner agent” and evolution as the “outer optimizer”—I think this is, in general, the wrong framing, but let’s roll with it for now. I would guess that Eliezer believes that human values are an unpredictable godshatter with respect to the outer criterion of inclusive genetic fitness. This means that if you reroll evolution many times with perturbed initial conditions, you get inner agents with dramatically different values each time—it means that human values are akin to a raindrop which happened to land in some location for no grand reason. I notice that I have medium-strength objections to this claim, but let’s just say that he is correct for now.

    I think this unpredictability-to-evolution doesn’t matter. We aren’t going to reroll evolution to get AGI. Thus, for a variety of reasons too expansive for this margin, I am little moved by analogy-based reasoning along the lines of “here’s the one time inner alignment was tried in reality, and evolution failed horribly.” I think that historical fact is mostly irrelevant, for reasons I will discuss later.