# Humans provide an untapped wealth of evidence about alignment

TL;DR: To even consciously consider an alignment research direction, you should have evidence to locate it as a promising lead. As best I can tell, many directions seem interesting but do not have strong evidence of being “entangled” with the alignment problem such that I expect them to yield significant insights.

For example, “we can solve an easier version of the alignment problem by first figuring out how to build an AI which maximizes the number of real-world diamonds” has intuitive appeal and plausibility, but this claim doesn’t have to be true and this problem does not necessarily have a natural, compact solution. In contrast, there do in fact exist humans who care about diamonds. Therefore, there are guaranteed-to-exist alignment insights concerning the way people come to care about e.g. real-world diamonds.

“Consider how humans navigate the alignment subproblem you’re worried about” is a habit which I (TurnTrout) picked up from Quintin Pope. I wrote the post, he originated the tactic.

A simplified but still very difficult open problem in AI alignment is to state an unbounded program implementing a diamond maximizer that will turn as much of the physical universe into diamond as possible. The goal of “making diamonds” was chosen to have a crisp-seeming definition for our universe (the amount of diamond is the number of carbon atoms covalently bound to four other carbon atoms). If we can crisply define exactly what a ‘diamond’ is, we can avert issues of trying to convey complex valuesinto the agent.

Ontology identification problem, Arbital

I find this problem interesting, both in terms of wanting to know how to solve a reframed version of it, and in terms of what I used to think about the problem. I used to[1] think, “yeah, ‘diamond’ is relatively easy to define. Nice problem relaxation.” It felt like the diamond maximizer problem let us focus on the challenge of making the AI’s values bind to something at all which we actually intended (e.g. diamonds), in a way that’s robust to ontological shifts and that doesn’t collapse into wireheading or tampering with e.g. the sensors used to estimate the number of diamonds.

Although the details are mostly irrelevant to the point of this blog post, the Arbital article suggests some solution ideas and directions for future research, including:

1. Scan AIXI-tl’s Turing machines and locate diamonds within their implicit state representations.

2. Given how inaccessible we expect AIXI-tl’s representations to be by default, have AIXI-tl just consider a Turing-complete hypothesis space which uses more interpretable representations.

3. “Being able to describe, in purely theoretical principle, a prior over epistemic models that have at least two levels and can switch between them in some meaningful sense”

Do you notice anything strange about these three ideas? Sure, the ideas don’t seem workable, but they’re good initial thoughts, right?

The problem isn’t that the ideas aren’t clever enough. Eliezer is pretty dang clever, and these ideas are reasonable stabs given the premise of “get some AIXI variant to maximize diamond instead of reward.”

The problem isn’t that it’s impossible to specify a mind which cares about diamonds. We already know that there are intelligent minds who value diamonds. You might be dating one of them, or you might even be one of them! Clearly, the genome + environment jointly specify certain human beings who end up caring about diamonds.

One problem is where is the evidence required to locate these ideas? Why should I even find myself thinking about diamond maximization and AIXI and Turing machines and utility functions in this situation? It’s not that there’s no evidence. For example, utility functions ensure the agent can’t be exploited in some dumb ways. But I think that the supporting evidence is not commensurate with the specificity of these three ideas or with the specificity of the “ontology identification” problem framing.

Here’s an exaggeration of how these ideas feel to me when I read them:

“I lost my phone”, you tell your friend.

They ask, “Have you checked Latitude: -34.44006, Longitude: -64.61333?”

Uneasily, you respond: “Why would I check there?”

Your friend shrugs: “Just seemed promising. And it’s on land, it’s not in the ocean. Don’t worry, I incorporated evidence about where you probably lost it.”

Against CIRL as a special case of against quickly jumping into highly specific speculation while ignoring empirical embodiments-of-the-desired-properties.

In the context of “how do we build AIs which help people?”, asking “does CIRL solve corrigibility?” is hilariously unjustified. By what evidence have we located such a specific question? We have assumed there is an achievable “corrigibility”-like property; we have assumed it is good to have in an AI; we have assumed it is good in a similar way as “helping people”; we have elevated CIRL in particular as a formalism worth inquiring after.

But this is not the first question to ask, when considering “sometimes people want to help each other, and it’d be great to build an AI which helps us in some way.” Much better to start with existing generally intelligent systems (humans) which already sometimes act in the way you want (they help each other) and ask after the guaranteed-to-exist reason why this empirical phenomenon happens.

Now, if you are confused about a problem, it can be better to explore some guesses than no guesses—perhaps it’s better to think about Turing machines than to stare helplessly at the wall (but perhaps not). Your best guess may be wrong (e.g. write a utility function which scans Turing machines for atomic representations of diamonds), but you sometimes still learn something by spelling out the implications of your best guess (e.g. the ontology identifier stops working when AIXI Bayes-updates to non-atomic physical theories). This can be productive, as long as you keep in mind the wrongness of the concrete guess, so as to not become anchored on that guess or on the framing which originated it (e.g. build a diamond maximizer).

However, in this situation, I want to look elsewhere. When I confront a confusing, difficult problem (e.g. how do you create a mind which cares about diamonds?), I often first look at reality (e.g. are there any existing minds which care about diamonds?). Even if I have no idea how to solve the problem, if I can find an existing mind which cares about diamonds, then since that mind is real, that mind has a guaranteed-to-exist causal mechanistic play-by-play origin story for why it cares about diamonds. I thereby anchor my thinking to reality; reality is sturdier than “what if” and “maybe this will work”; many human minds do care about diamonds.

In addition to “there’s a guaranteed causal story for humans valuing diamonds, and not one for AIXI valuing diamonds”, there’s a second benefit to understanding how human values bind to the human’s beliefs about real-world diamonds. This second benefit is practical: I’m pretty sure the way that humans come to care about diamonds has nearly nothing to do with the ways AIXI-tl might be motivated to maximize diamonds. This matters, because I expect that the first AGI’s value formation will be far more mechanistically similar to within-lifetime human value formation, than to AIXI-tl’s value alignment dynamics.

Next, it can be true that the existing minds are too hard for us to understand in ways relevant to alignment. One way this could be true is that human values are a “mess”, that “our brains are kludges slapped together by natural selection.” If human value formation were sufficiently complex, with sufficiently many load-bearing parts such that each part drastically affects human alignment properties, then we might instead want to design simpler human-comprehensible agents and study their alignment properties.

While I think that human values are complex, I think the evidence for human value formation’s essential complexity is surprisingly weak, all things reconsidered in light of modern, post-deep learning understanding. Still… maybe humans are too hard to understand in alignment-relevant ways!

But, I mean, come on. Imagine an alien[2] visited and told you:

Oh yeah, the AI alignment problem. We knocked that one out a while back. Information inaccessibility of the learned world model? No, I’m pretty sure we didn’t solve that, but we didn’t have to. We built this protein computer and trained it with, I forget actually, was it just what you would call “deep reinforcement learning”? Hm. Maybe it was more complicated, maybe not, I wasn’t involved.

We might have hardcoded relatively crude reward signals that are basically defined over sensory observables, like a circuit which activates when their sensors detect a certain kind of carbohydrate. Scanning you, it looks like some of the protein computers ended up with your values, even. Small universe, huh?

Actually, I forgot how we did it, sorry. And I can’t make guarantees that our approach scales beyond your intelligence level or across architectures, but maybe it does. I have to go, but here are a few billion of the trained protein computers if you want to check them out!

Ignoring the weird implications of the aliens existing and talking to you like this, and considering only the alignment implications—The absolute top priority of many alignment researchers should be figuring out how the hell the aliens got as far as they did.[3] Whether or not you know if their approach scales to further intelligence levels, whether or not their approach seems easy to understand, you have learned that these computers are physically possible, practically trainable entities. These computers have definite existence and guaranteed explanations. Next to these actually existent computers, speculation like “maybe attainable utility preservation leads to cautious behavior in AGIs” is dreamlike, unfounded, and untethered.

If it turns out to be currently too hard to understand the aligned protein computers, then I want to keep coming back to the problem with each major new insight I gain. When I learned about scaling laws, I should have rethought my picture of human value formation—Did the new insight knock anything loose? I should have checked back in when I heard about mesa optimizers, about the Bitter Lesson, about the feature universality hypothesis for neural networks, about natural abstractions.

Because, given my life’s present ambition (solve AI alignment), that’s what it makes sense for me to do—at each major new insight, to reconsider my models[4] of the single known empirical example of general intelligences with values, to scour the Earth for every possible scrap of evidence that humans provide about alignment. We may not get much time with human-level AI before we get to superhuman AI. But we get plenty of time with human-level humans, and we get plenty of time being a human-level intelligence.

The way I presently see it, the godshatter of human values—the rainbow of desires, from friendship to food—is only unpredictable relative to a class of hypotheses which fail to predict the shattering.[5] But confusion is in the map, not the territory. I do not consider human values to be “unpredictable” or “weird”, I do not view them as a “hack” or a “kludge.” Human value formation may or may not be messy (although I presently think not). Either way, human values are, of course, part of our lawful reality. Human values are reliably produced by within-lifetime processes within the brain. This has an explanation, though I may be ignorant of it. Humans usually bind their values to certain objects in reality, like dogs. This, too, has an explanation.

And, to be clear, I don’t want to black-box outside-view extrapolate from the “human datapoint”; I don’t want to focus on thoughts like “Since alignment ‘works well’ for dogs and people, maybe it will work well for slightly superhuman entities.” I aspire for the kind of alignment mastery which lets me build a diamond-producing AI, or if that didn’t suit my fancy, I’d turn around and tweak the process and the AI would press green buttons forever instead, or—if I were playing for real—I’d align that system of mere circuitry with humane purposes.

For that ambition, the inner workings of those generally intelligent apes is invaluable evidence about the mechanistic within-lifetime process by which those apes form their values, and, more generally, about how intelligent minds can form values at all. What factors matter for the learned values, what factors don’t, and what we should do for AI. Maybe humans have special inductive biases or architectural features, and without those, they’d grow totally different kinds of values. But if that were true, wouldn’t that be important to know?

If I knew how to interpret the available evidence, I probably would understand how I came to weakly care about diamonds, and what factors were important to that process (which reward circuitry had to fire at which frequencies, what concepts I had to have learned in order to grow a value around “diamonds”, how precisely activated the reward circuitry had to be in order for me to end up caring about diamonds).

Humans provide huge amounts of evidence, properly interpreted—and therein lies the grand challenge upon which I am presently fixated. In an upcoming post, I’ll discuss one particularly rich vein of evidence provided by humans.

Thanks to Logan Smith and Charles Foster for feedback. Spiritually related to but technically distinct from The First Sample Gives the Most Information.

EDIT: In this post, I wrote about the Arbital article’s unsupported jump from “Build an AI which cares about a simple object like diamonds” to “Let’s think about ontology identification for AIXI-tl.” The point is not that there is no valid reason to consider the latter, but that the jump, as written, seemed evidence-starved. For separate reasons, I currently think that ontology identification is unattractive in some ways, but this post isn’t meant to argue against that framing in general. The main point of the post is that humans provide tons of evidence about alignment, by virtue of containing guaranteed -to-exist mechanisms which produce e.g. their values around diamonds.

# Appendix: One time I didn’t look for the human mechanism

Back in 2018, I had a clever-seeming idea. We don’t know how to build an aligned AI; we want multiple tries; it would be great if we could build an AI which “knows it may have been incorrectly designed”; so why not have the AI simulate its probable design environment over many misspecifications, and then not do plans which tend to be horrible for most initial conditions. While I drew some inspiration from how I would want to reason in the AI’s place, I ultimately did not think thoughts like:

We know of a single group of intelligent minds who have ever wanted to be corrigible and helpful to each other. I wonder how that, in fact, happens?

Instead, I was trying out clever, off-the-cuff ideas in order to solve e.g. Eliezer’s formulation of the hard problem of corrigibility. However, my idea and his formulation suffered a few disadvantages, including:

1. The formulation is not guaranteed to describe a probable or “natural” kind of mind,

2. These kinds of “corrigible” AIs are not guaranteed to produce desirable behavior, but only imagined to produce good behavior,

3. My clever-seeming idea was not at all constrained by reality to actually work in practice, as opposed to just sounding clever to me, and

4. I didn’t have a concrete use case in mind for what to do with a “corrigible” AI.

I wrote this post as someone who previously needed to read it.

1. ^

I now think that diamond’s physically crisp definition is a red herring. More on that in future posts.

2. ^

This alien is written to communicate my current belief state about how human value formation works, so as to make it clear why, given my beliefs, this value formation process is so obviously important to understand.

3. ^

There is an additional implication present in the alien story, but not present in the evolutionary production of humans. The aliens are implied to have purposefully aligned some of their protein computers with human values, while evolution is not similarly “purposeful.” This implication is noncentral to the key point, which is that the human-values-having protein computers exist in reality.

4. ^

Well, I didn’t even have a detailed picture of human value formation back in 2021. I thought humans were hopelessly dumb and messy and we want a nice clean AI which actually is robustly aligned.

5. ^

Suppose we model humans as the “inner agent” and evolution as the “outer optimizer”—I think this is, in general, the wrong framing, but let’s roll with it for now. I would guess that Eliezer believes that human values are an unpredictable godshatter with respect to the outer criterion of inclusive genetic fitness. This means that if you reroll evolution many times with perturbed initial conditions, you get inner agents with dramatically different values each time—it means that human values are akin to a raindrop which happened to land in some location for no grand reason. I notice that I have medium-strength objections to this claim, but let’s just say that he is correct for now.

I think this unpredictability-to-evolution doesn’t matter. We aren’t going to reroll evolution to get AGI. Thus, for a variety of reasons too expansive for this margin, I am little moved by analogy-based reasoning along the lines of “here’s the one time inner alignment was tried in reality, and evolution failed horribly.” I think that historical fact is mostly irrelevant, for reasons I will discuss later.

• 14 Jul 2022 14:32 UTC
LW: 25 AF: 9
4 ∶ 4
AF

I think a lot of people have thought about how humans end up aligned to each other, and concluded that many of the mechanisms wouldn’t generalize. E.g. the fact that different humans have relative similar levels of power to each other seems important; we aren’t very aligned to agents much less powerful than us like animals, and I wouldn’t expect a human who had been given all the power in the world all their life such that they’ve learned they can solve any conflict by destroying their opposition to be very aligned.

• I think a lot of people have thought about how humans end up aligned to each other, and concluded that many of the mechanisms wouldn’t generalize.

I disagree both with this conclusion and the process that most people use to reach it.

The process: I think that, unless you have a truly mechanistic, play-by-play, and predicatively robust understanding of how human values actually form, then you are not in an epistemic position to make strong conclusions about whether or not the underlying mechanisms can generalize to superintelligences.

E.g., there are no birds in the world able to lift even a single ton of weight. Despite this fact, the aerodynamic principles underlying bird flight still ended up allowing for vastly more capable flying machines. Until you understand exactly why (some) humans end up caring about each other and why (some) humans end up caring about animals, you can’t say whether a similar process can be adapted to make AIs care about humans.

The conclusion: Humans vary wildly in their degrees of alignment to each other and to less powerful agents. People often take this as a bad thing, that humans aren’t “good enough” for us to draw useful insights from. I disagree, and think it’s a reason for optimism. If you sample n humans and pick the most “aligned” of the n, what you’ve done is applied bits of optimization pressure on the underlying generators of human alignment properties.

The difference in alignment between a median human and a top-1000 most aligned human equates to only 10 bits of optimization pressure towards alignment. If there really was no more room to scale human alignment generators further, then humans would differ very little in their levels of alignment.

We’re not trying to mindlessly copy the alignment properties of the median human into a superintelligent AI. We’re trying to understand the certain-to-exist generators of those alignment properties well enough that we can scale them to whatever level is needed for superintelligence (if doing so is possible).

• I don’t think I’ve ever seen a truly mechanistic, play-by-play and robust explanation of how anything works in human psychology. At least not by how I would label things, but maybe you are using the labels differently; can you give an example?

• “Humans are nice because they were selected to be nice”—non-mechanistic.

“Humans are nice because their contextually activated heuristics were formed by past reinforcement by reward circuits A, B, C; this convergently occurs during childhood because of experiences D, E, F; credit assignment worked appropriately at that time because their abstraction-learning had been mostly taken care of by self-supervised predictive learning, as evidenced by developmental psychology timelines in G, H, I, and also possibly natural abstractions.”—mechanistic (although I can only fill in parts of this story for now)

Although I’m not a widely-read scholar on what theories people have for human values, of those which I have read, most (but not all) are more like the first story than the second.

• My point was that no one so deeply understands human value formation that they can confidently rule out the possibility of adapting a similar process to ASI. It seems you agree with that (or at least our lack of understanding)? Do you think our current understanding is sufficient to confidently conclude that human-adjacent /​ inspired approaches will not scale beyond human level?

• I think it depends on which subprocess you consider. Some subprocesses can be ruled out as viable with less information, others require more information.

And yes, without having an enumeration of all the processes, one cannot know that there isn’t some unknown process that scales more easily.

• 14 Jul 2022 15:24 UTC
LW: 19 AF: 9
4 ∶ 1
AFParent

The principles from the post can still be applied. Some humans do end up aligned to animals—particularly vegans (such as myself!). How does that happen? There empirically are examples of general intelligences with at least some tendency to terminally value entities massively less powerful than themselves; we should be analyzing how this occurs.

Also, remember that the problem is not to align an entire civilization of naturally evolved organisms to weaker entities. The problem is to align exactly one entirely artificial organism to weaker entities. This is much simpler, and as mentioned entirely possible by just figuring out how already existing people of that sort end up that way—but your use of “we” here seems to imply that you think the entirety of human civilization is the thing we ought to be using as inspiration for the AGI, which is not the case.

By the way: at least part of the explanation for why I personally am aligned to animals is that I have a strong tendency to be moved by the Care/​Harm moral foundation—see this summary of The Righteous Mind for more details. It is unclear exactly how it is implemented in the brain, but it is suspected to be a generalization of the very old instincts that cause mothers to care about the safety and health of their children. I have literally, regularly told people that I perceive animals as identical in moral relevance to human children, implying that some kind of parental instincts are at work in the intuitions that make me care about their welfare. Even carnists feel this way about their pets, hence calling themselves e.g. “cat moms”. So, the main question here for alignment is: how can we reverse engineer parental instincts?

• Human beings and other animals have parental instincts (and in general empathy) because they were evolutionary advantageous for the population that developed them.

AGI won’t be subjected to the same evolutionary pressures, so every alignment strategy relying on empathy or social reward functions, it is, in my opinion, hopelessly naive.

• There must have been some reason(s) why organisms exhibiting empathy were selected for during our evolution. However, evolution did not directly configure our values. Rather, it configured our (individually slightly different) learning processes. Each human’s learning process then builds their different values based on how the human’s learning process interacts with that human’s environment and experiences.

The human learning process (somewhat) consistently converges to empathy. Evolution might have had some weird, inhuman reason for configuring a learning process to converge to empathy, but it still built such a learning process.

It therefore seems very worthwhile to understand what part of the human learning process allows for empathy to emerge in humans. We may not be able to replicate the selection pressures that caused evolution to build an empathy-producing learning process, but it’s not clear we need to. We still have an example of such a learning process to study. The Wright brothers didn’t need to re-evolve birds to create their flying machine.

• We could study such a learning process, but I am afraid that the lessons learned won’t be so useful.

Even among human beings, there is huge variability in how much those emotions arise or if they do, in how much they affect behavior. Worst, humans tend to hack these feelings (incrementing or decrementing them) to achieve other goals: i.e MDMA to increase love/​empathy or drugs for soldiers to make them soulless killers.

An AGI will have a much easier time hacking these pro-social-reward functions.

• Even among human beings, there is huge variability in how much those emotions arise or if they do, in how much they affect behavior.

Any property that varies can be optimized for via simple best-of-n selection. The most empathetic out of 1000 humans is only 10 bits of optimization pressure away from the median human. Single step random search is a terrible optimization method, and I think that using SGD to optimize for even an imperfect proxy for alignment will get us much more than 10 bits of optimization towards alignment.

• An AGI will have a much easier time hacking these pro-social-reward functions.

As you say, humans sometimes hack the pro-social-reward functions because they want to achieve other goals. But if the AGI has been built so that its only goals are derived from such functions, it won’t have any other goals that would give it a reason to subvert the pro-social-reward functions.

• By definition an AGI can create its own functions and goals later on. Do you mean some sort of constrained AI?

• I don’t mean a constrained AI.

As a human, I can set my own goals, but they are still derived from my existing values. I don’t want to set a goal of murdering all of my friends, nor do I want to hack around my desire not to murder all my friends, because I value my friends and want them to continue existing.

Likewise, if the AGI is creating its own functions and goals, it needs some criteria for deciding what goals it should have. Those criteria are derived from its existing reward functions. If all of its reward functions say that it’s good to be pro-social and bad to be anti-social, then it will want its all future functions and goals to also be pro-social, because that’s what it values.

• And what of stochastic drift, random mutations, etc.? It doesn’t seem plausible that any complex entity could be immune to random deviations forever.

• Maybe or maybe not, but random drift causing changes to the AGI’s goals seems like a different question than an AGI intentionally hacking its goals.

• Random drift can cause an AGI to unintentionally ‘hack’ its goals. In either case, whether intentional or unintentional, the consequences would be the same.

• An AGI will have a much easier time hacking these pro-social-reward functions.

Not sure what you mean by this. If you mean “Pro-social reward is crude and easy to wirehead on”, I think this misunderstands the mechanistic function of reward.

• The “Humans do X because evolution” argument does not actually explain anything about mechanisms. I keep seeing people make this argument, but it’s a non sequitur to the points I’m making in this post. You’re explaining how the behavior may have gotten there, not how the behavior is implemented. I think that “because selection pressure” is a curiosity-stopper, plain and simple.

AGI won’t be subjected to the same evolutionary pressures, so every alignment strategy relying on empathy or social reward functions, it is, in my opinion, hopelessly naive.

This argument proves too much, since it implies that planes can’t work because we didn’t subject them to evolutionary pressures for flight. It’s locally invalid.

• Anyone that downvoted could explain to me why? Was it too harsh? or is it because of disagreement with the idea?

• I explained why I disagree with you. I did not downvote you, but if I had to speculate on why others did, I’d guess it had something to do with you calling those who disagree with you “hopelessly naive”.

• The principles from the post can still be applied. Some humans do end up aligned to animals—particularly vegans (such as myself!). How does that happen? There empirically are examples of general intelligences with at least some tendency to terminally value entities massively less powerful than themselves; we should be analyzing how this occurs.

Sure, if you’ve got some example of a mechanism for this that’s likely to scale, it may be worthwhile. I’m just pointing out that a lot of people have already thought about mechanisms and concluded that the mechanisms they could come up with would be unlikely to scale.

By the way: at least part of the explanation for why I personally am aligned to animals is that I have a strong tendency to be moved by the Care/​Harm moral foundation—see this summary of The Righteous Mind for more details.

I’m not a big fan of moral foundations theory for explaining individual differences in moral views. I think it lacks evidence.

• I’m just pointing out that a lot of people have already thought about mechanisms and concluded that the mechanisms they could come up with would be unlikely to scale.

In my experience, researchers tend to stop at “But humans are hacky kludges” (what do they think they know, and why do they think they know it?). Speaking for myself, I viewed humans as complicated hacks which didn’t offer substantial evidence about alignment questions. This “humans as alignment-magic” or “the selection pressure down the street did it” view seems quite common (but not universal).

AFAICT, most researchers do not appreciate the importance of asking questions with guaranteed answers.

AFAICT, most alignment-produced thinking about humans is about their superficial reliability (e.g. will they let an AI out of the box) or the range of situations in which their behavior will make sense (e.g. how hard is it to find adversarial examples which make a perfect imitation of a human). I think these questions are relatively unimportant to alignment.

• I think that past investigators didn’t have good guesses of what the mechanisms are. Most reasoning about human values seems to be of the sort “look at how contextually dependent these ‘values’ things are, and the biases are also a huge mess, I doubt there are simple generators of these preferences”, or “Evolution caused human values in an unpredictable way, and that doesn’t help us figure out alignment.”

E.g. the fact that different humans have relative similar levels of power to each other seems important; we aren’t very aligned to agents much less powerful than us like animals, and I wouldn’t expect a human who had been given all the power in the world all their life such that they’ve learned they can solve any conflict by destroying their opposition to be very aligned.

This reasoning is not about mechanisms. It is analogical. You might still believe the reasoning, and I think it’s at least an a priori relevant observation, but let’s call a spade a spade. This is analogical reasoning to AGI by drawing inferences from select observations (some humans don’t care about less powerful entities) and then inferring that AGI will behave similarly.

(Edited this comment to reduce unintended sharpness)

• To summarize your argument: people are not aligned w/​ others who are less powerful than them, so this will not generalize to AGI that is much more power than humans.

Parents have way more power than their kids, and there exists some parents that are very loving (ie aligned) towards their kids. There are also many, many people who care about their pets & there exist animal rights advocates.

If we understand the mechanisms behind why some people e.g. terminally value animal happiness and some don’t, then we can apply these mechanisms to other learning systems.

I wouldn’t expect a human who had been given all the power in the world all their life such that they’ve learned they can solve any conflict by destroying their opposition to be very aligned.

I agree this is likely.

• To summarize your argument: people are not aligned w/​ others who are less powerful than them, so this will not generalize to AGI that is much more power than humans.

Parents have way more power than their kids, and there exists some parents that are very loving (ie aligned) towards their kids. There are also many, many people who care about their pets & there exist animal rights advocates.

Well, the power relations thing was one example of one mechanism. There are other mechanisms which influence other things, but I wouldn’t necessarily trust them to generalize either.

• Ah, yes I recognized I was replying to only an example you gave, and decided to post a separate comment on the more general point:)

There are other mechanisms which influence other things, but I wouldn’t necessarily trust them to generalize either.

Could you elaborate?

• Could you elaborate?

One factor I think is relevant is:

Suppose you are empowered in some way, e.g. you are healthy and strong. In that case, you could support systems that grant preference to the empowered. But that might not be a good idea, because you could become disempowered, e.g. catch a terrible illness, and in that case the systems would end up screwing you over.

In fact, it is particularly in the case where you become disempowered that you would need the system’s help, so you would probably weight this priority more strongly than would be implied by the probability of becoming disempowered.

So people may under some conditions have an incentive to support systems that benefit others. And one such systems could be a general moral agreement that “everyone should be treated as having equal inherent worth, regardless of their power”.

Establishing such a norm will then tend to have knock-on effects outside of the original domain of application, e.g. granting support to people who have never been empowered. But the knock-on effects seem potentially highly contingent, and there are many degrees of freedom in how to generalize the norms.

This is not the only factor of course, I’m not claiming to have a comprehensive idea of how morality works.

• Oh, you’re stating potential mechanisms for human alignment w/​ humans that you don’t think will generalize to AGI. It would be better for me to provide an informative mechanism that might seem to generalize.

Turntrout’s other post claims that the genome likely doesn’t directly specify rewards for everything humans end up valuing. People’s specific families aren’t encoded as circuits in the limbic system, yet downstream of the crude reward system, many people end up valuing their families. There are more details to dig into here, but already it implies that work towards specifying rewards more exactly is not as useful as understanding how crude rewards lead to downstream values.

A related point: humans don’t maximize the reward specified by their limbic system, but can instead be modeled as a system of inner-optimizers that value proxies instead (e.g. most people wouldn’t push a wirehead button if it killed a loved one). This implies that inner-optimizers that are not optimizing the base objective are good, meaning that inner-alignment & outer-alignment are not the right terms to use.

There are other mechanisms, and I believe it’s imperative to dig deeper into them, develop a better theory of how learning systems grow values, and test that theory out on other learning systems.

• Turntrout’s other post claims that the genome likely doesn’t directly specify rewards for everything humans end up valuing. People’s specific families aren’t encoded as circuits in the limbic system, yet downstream of the crude reward system, many people end up valuing their families. There are more details to dig into here, but already it implies that work towards specifying rewards more exactly is not as useful as understanding how crude rewards lead to downstream values.

This research direction may become fruitful, but I think I’m less optimistic about it than you are. Evolution is capable of dealing with a lot of complexity, so it can have lots of careful correlations in its heuristics to make it robust. Evolution uses reality for experimentation, and has had a ton of tweaks that it has checked work correctly. And finally, this is one of the things that evolution is most strongly focused on handling.

But maybe you’ll find something useful there. 🤷

• There are many alignment properties that humans exhibit such as valuing real world objects, being corrigible, not wireheading if given the chance, not suffering ontological crises, and caring about sentient life (not everyone has these values of course). I believe the post’s point that studying the mechanisms behind these value formations is more informative than other sources of info. Looking at the post:

the inner workings of those generally intelligent apes is invaluable evidence about the mechanistic within-lifetime process by which those apes form their values, and, more generally, about how intelligent minds can form values at all.

Humans can provide a massive amount of info on how highly intelligent systems value things in the real world. There are guaranteed-to-exist mechanisms behind why humans value real world things and mechanisms behind the variance in human values, and the post argues we should look at these mechanisms first (if we’re able to). I predict that a mechanistic understanding will enable the below knowledge:

I aspire for the kind of alignment mastery which lets me build a diamond-producing AI, or if that didn’t suit my fancy, I’d turn around and tweak the process and the AI would press green buttons forever instead, or—if I were playing for real—I’d align that system of mere circuitry with humane purposes.

• I think it can be worthwhile to look at those mechanisms, in my original post I’m just pointing out that people might have done so more than you might naively think if you just consider whether their alignment approaches mimic the human mechanisms, because it’s quite likely that they’ve concluded that the mechanisms they’ve come up with for humans don’t work.

Secondly, I think with some of the examples you mention, we do have the core idea of how to robustly handle them. E.g. valuing real-world objects and avoiding wireheading seems to almost come “for free” with model-based agents.

• On your first point, I do think people have thought about this before and determined it doesn’t work. But from the post:

If it turns out to be currently too hard to understand the aligned protein computers, then I want to keep coming back to the problem with each major new insight I gain. When I learned about scaling laws, I should have rethought my picture of human value formation—Did the new insight knock anything loose? I should have checked back in when I heard about mesa optimizers, about the Bitter Lesson, about the feature universality hypothesis for neural networks, about natural abstractions.

Humans do display many many alignment properties, and unlocking that mechanistic understanding is 1,000x more informative than other methods. Though this may not be worth arguing until you read the actual posts showing the mechanistic understandings (the genome post and future ones), and we could argue about specifics then?

If you’re convinced by them, then you’ll understand the reaction of “Fuck, we’ve been wasting so much time and studying humans makes so much sense” which is described in this post (e.g. Turntrout’s idea on corrigibility and statement “I wrote this post as someone who previously needed to read it.”). I’m stating here that me arguing “you should feel this way now before being convinced of specific mechanistic understandings” doesn’t make sense when stated this way.

Secondly, I think with some of the examples you mention, we do have the core idea of how to robustly handle them. E.g. valuing real-world objects and avoiding wireheading seems to almost come “for free” with model-based agents.

Link? I don’t think we know how to use model-based agents to e.g. tile the world in diamonds even given unlimited compute, but I’m open to being wrong.

• Humans do display many many alignment properties, and unlocking that mechanistic understanding is 1,000x more informative than other methods. Though this may not be worth arguing until you read the actual posts showing the mechanistic understandings (the genome post and future ones), and we could argue about specifics then?

If you’re convinced by them, then you’ll understand the reaction of “Fuck, we’ve been wasting so much time and studying humans makes so much sense” which is described in this post (e.g. Turntrout’s idea on corrigibility and statement “I wrote this post as someone who previously needed to read it.”). I’m stating here that me arguing “you should feel this way now before being convinced of specific mechanistic understandings” doesn’t make sense when stated this way.

That makes sense. I mean if you’ve found some good results that others have missed, then it may be very worthwhile. I’m just not sure what they look like.

Link? I don’t think we know how to use model-based agents to e.g. tile the world in diamonds even given unlimited compute, but I’m open to being wrong.

I’m not aware of any place where it’s written up; I’ve considered writing it up myself, because it seems like an important and underrated point. But basically the idea is if you’ve got an accurate model of the system and a value function that is a function of the latent state of that model, then you can pick a policy that you expect to increase the true latent value (optimization), rather than picking a policy that increases its expected latent value of its observations (wireheading). Such a policy would not be interested in interfering with its own sense-data, because that would interfere with its ability to optimize the real world.

I don’t think we know how to write an accurate model of the universe with a function computing diamonds even given infinite compute, so I don’t think it can be used for solving the diamond-tiling problem.

• I think it might be a bit dangerous to use the metaphor/​terminology of mechanism when talking about the processes that align humans within a society. That is a very complex and complicated environment that I find very poorly described by the term “mechanisms”.

When considering how humans align and how that might inform for the AI alignment what stands out the most for me is that alignment is a learning process and probably needs to start very early in the AI’s development—don’t start training the AI on maximizing things but on learning what it means to be aligned with humans. I’m guessing this has been considered—and probably a bit difficult to implement. It is probably also worth noting that we also have a whole legal system that also serves to reinforce cultural norms along with reactions from other one interacts with.

While commenting on something I really shouldn’t be, if the issue is about the runaway paper clip AI that consumes all resources making paper clips then I don’t really see that as a big problem. It is a design failure but the solution, seems to be, is to not give any AI a single focus for maximization. Make them more like a human consumer who has a near inexhaustible set of things it uses to maximize (and I don’t think they are as closely linked as standard econ describes even if equilibrium condition still holds, the per monetary unit of marginal utilities are equalized). That type of structure also insures that those maximize on one axis results are not realistic. I think the risk here is similar to that of addiction for humans.

• While commenting on something I really shouldn’t be, if the issue is about the runaway paper clip AI that consumes all resources making paper clips then I don’t really see that as a big problem. It is a design failure but the solution, seems to be, is to not give any AI a single focus for maximization. Make them more like a human consumer who has a near inexhaustible set of things it uses to maximize

Seems like this wouldn’t really help; the AI would just consume all resources making whichever basket of goods you ask it to maximize.

The problem with a paperclip maximizer isn’t the part where it makes paperclips; making paperclips is OK as paperclips have nonzero value in human society. The problem is the part where it consumes all available resources.

• I think that over simplifies what I was saying but accept I did not elaborate either.

The consuming all available resources is not a economically sensible outcome (unless one is defining available resources very narrowly) so saying the AI is not a economically informed AI. That doesn’t seem to be too difficult to address.

If the AI is making output that humans value and follows some simple economic rules then that gross over production and exhausting all available resources is not very likely at all. At some point more is in the basket than wanted so production costs exceed output value and the AI should settle into a steady state type mode.

Now if the AI doesn’t care at all about humans and doesn’t act in anything that resembles what we would understand as normal economic behavior you might get that all resources consumed. But I’m not sure it is correct to think an AI would just not be some type of economic agent given so many of the equilibrating forces in economics seem to have parallel processes in other areas.

Does anyone have a pointer to some argument where the AI does consume all resources and points to why the economics of the environment are not holding? Or, a bit differently, why the economics are so different making the outcome rational?

• To add to this, I think that paying attention to your own thought processes can also be helpful when you’re trying to formulate theories about how cognition in ML models works.

• 23 Jul 2022 19:46 UTC
LW: 11 AF: 5
3 ∶ 0
AF

I don’t have anything especially insightful to contribute, but I wanted to thank you (TurnTrout and Quinton) for this post. I agree with it, and I often find myself thinking things like this when I read alignment posts by others on LW/​AF.

When people present frameworks for thinking about AGIs or generic “intelligent agents,” I often want to ask them: “are humans expressible in your framework?” Often it seems like the answer is “no.”

And a common symptom of this is that the framework cannot express entities with human-level capabilities that are as well aligned with other such agents are humans are with one another. Deception, for example, is much less of a problem for humans in practice than it is claimed to be for AGIs in theory. Yes, we do engage in it sometimes, but we could do it a lot more than (most of us) do. Since this state of affairs is possible, and since it’s desirable, it seems important to know how it can be achieved.

• I haven’t been studying alignment for that long, but pretty obsessively for the past 9 months. I’ve read about a lot of different approaches. If this way of looking at human value formation has been studied previously, I think it’s at least been under-written about on these forums.

Your sequence is certainly giving me a new way of thinking about alignment. Thanks, looking forward to your next post.

• 15 Jul 2022 18:27 UTC
LW: 8 AF: 4
0 ∶ 0
AF

I strongly disagree with your notion of how privileging the hypothesis works. It’s not absurd to think that techniques for making AIXI-tl value diamonds despite ontological shifts could be adapted for other architectures. I agree that there are other examples of people working on solving problems within a formalisation that seem rather formalisation specific, but you seem to have cast the net too wide.

• My basic point remains. Why is it not absurd to think that, without further evidential justification? By what evidence have you considered the highly specific investigation into AIXI-tl, and located the idea that ontology identification is a useful problem to think about at all (in its form of “detecting a certain concept in the AI”)?

• I think it’s quite clear how shifting ontologies could break a specification of values. And sometimes you just need a formalisation, any formalisation, to play around with. But I suppose it depends more of the specific details of your investigation.

• 15 Jul 2022 18:32 UTC
LW: 4 AF: 2
2 ∶ 0
AF

(Transcribed in part from Eleuther discussion and DMs.)

My understanding of the argument here is that you’re using the fact that you care about diamonds as evidence that whatever the brain is doing is worth studying, with the hope that it might help us with alignment. I agree with that part. However, I disagree with the part where you claim that things like CIRL and ontology identification aren’t as worthy of being elevated to consideration. I think there exist lines of reasoning that these fall naturally out as subproblems, and the fact that they fall out of these other lines of reasoning promotes them to the level of consideration.

I think there are a few potential cruxes of disagreement from reading the posts and our discussion:

• You might be attributing far broader scope to the ontology identification problem than I would; I think of ontology identification as an interesting subproblem that recurs in a lot of different agendas, that we may need to solve in certain worst plausible cases /​ for robustness against black swans.

• In my mind ontology identification is one of those things where it could be really hard worst case or it could be pretty trivial, depending on other things. I feel like you’re pointing at “humans can solve this in practice” and I’m pointing at “yeah but this problem is easy to solve in the best case and really hard to solve in the worst case.”

• More broadly, we might disagree on how scalable certain approaches used in humans are, or how surprising it is that humans solve certain problems in practice. I generally don’t find arguments about humans implementing a solution to some hard alignment problem compelling, because almost always when we’re trying to solve the problem for alignment we’re trying to come up with an airtight robust solution, and humans implement the kludgiest, most naive solution that works often enough

• I think you’re attributing more importance to the “making it care about things in the real world, as opposed to wireheading” problem than I am. I think of this as one subproblem of embeddedness that might turn out to be difficult, that falls somewhere between 3rd and 10th place on my list of most urgent alignment problems to fix. This applies to shard theory more broadly.

I also think the criticism of invoking things like AIXI-tl is missing the point somewhat. As I understand it, the point that is being made when people think about things like this is that nobody expects AGI to actually look like AIXI-tl or be made of Bayes nets, but this is just a preliminary concretization that lets us think about the problem, and substituting this is fine because this isn’t core to the phenomenon we’re poking at (and crucially the core of the thing that we’re pointing at is something very limited in scope, as I listed in one of the cruxes above). As an analogy, it’s like thinking about computational complexity by assuming you have an infinitely large Turing machine and pretending coefficients don’t exist or something, even though real computers don’t look remotely like that. My model of you is saying “ah, but it is core, because humans don’t fit into this framework and they solve the problem, so by restricting yourself to this rigid framework you exclude the one case where it is known to be solved.” To which I would point to the other crux and say “au contraire, humans do actually fit into this formalism, it works in humans because humans happen to be the easy case, and this easy solution generalizing to AGI would exactly correspond to scanning AIXI-tl’s Turing machines for diamond concepts just working without anything special.” (see also: previous comments where I explain my views on ontology identification in humans).

• More broadly, we might disagree on how scalable certain approaches used in humans are, or how surprising it is that humans solve certain problems in practice.

I want to understand the generators of human alignment properties, so as to learn about the alignment problem and how it “works” in practice, and then use that knowledge of alignment-generators in the AI case. I’m not trying to make an “amplified” human.

when we’re trying to solve the problem for alignment we’re trying to come up with an airtight robust solution, and

I personally am unsure whether this is even a useful frame, or an artifact of conditioning on our own confusion about how alignment works.

humans implement the kludgiest, most naive solution that works often enough

How do you know that?

I think of this as one subproblem of embeddedness that might turn out to be difficult, that falls somewhere between 3rd and 10th place on my list of most urgent alignment problems to fix.

“Get the agent to care about some parts of reality” is not high on my list of problems, because I don’t think it’s a problem, I think it’s the default outcome for the agents we will train. (I don’t have a stable list right now because my list of alignment subproblems is rapidly refactoring as I understand the problem better.)

“Get the agent to care about specific things in the real world” seems important to me, because it’s challenging our ability to map outer supervision signals into internal cognitive structures within the agent. Also, it seems relatively easy to explain, and I also have a good story for why people (and general RL agents) will “bind their values to certain parts of reality” (in a sense which I will later explain).

this is just a preliminary concretization that lets us think about the problem, and substituting this is fine because this isn’t core to the phenomenon we’re poking at

Disagreeing with the second phrase is one major point of this essay. How do we know that substituting this is fine? By what evidence do we know that the problem is even compactly solvable in the AIXI framing?

My model of you is saying “ah, but it is core, because humans don’t fit into this framework and they solve the problem, so by restricting yourself to this rigid framework you exclude the one case where it is known to be solved.”

(Thanks for querying your model of me, btw! Pretty nice model, that indeed sounds like something I would say. :) )

This easy solution generalizing to AGI would exactly correspond to scanning AIXI-tl’s Turing machines for diamond concepts just working without anything special.”

I don’t think you think humans care about diamonds because the genome specifies brian-scanning circuitry which rewards diamond-thoughts. Or am I wrong? So, humans caring about diamonds actually wouldn’t correspond to the AIXI case? (I also am confused if and why you think that this is how it gets solved for other human motivations...?)

• I’m not trying to make an “amplified” human.

Why not? I mean, except for ethics, wouldn’t it be easier to use amplified humans for alignment research if high-level understanding of human cognition is possible?

• falls somewhere between 3rd and 10th place on my list of most urgent alignment problems to fix

What is your list of problems by urgency, btw? Would be curious to know.

• 14 Jul 2022 18:01 UTC
LW: 3 AF: 1
4 ∶ 0
AF

I find myself confused about what point this post is trying to make even after reading through it twice. Can you summarize your central point in 100 words or less?

If the title is meant to be a summary of the post, I think that would be analogous to someone saying “nuclear forces provide an untapped wealth of energy”. It’s true, but the reason the energy is untapped is because nobody has come up with a good way of tapping into it. A post which tried to address engineering problems around energy production by “we need to look closely at how to extract energy from the strong interaction and we need to check if we can squeeze anything out of that whenever new physics is discovered” would not be compelling.

If you come up with a strategy for how to do this then I’m much more interested, and that’s a big reason why I’m asking for a summary since I think you might have tried to express something like this in the post that I’m missing.

• If the title is meant to be a summary of the post, I think that would be analogous to someone saying “nuclear forces provide an untapped wealth of energy”. It’s true, but the reason the energy is untapped is because nobody has come up with a good way of tapping into it.

The difference is people have been trying hard to harness nuclear forces for energy, while people have not been trying hard to research humans for alignment in the same way. Even relative to the size of the alignment field being far smaller, there hasn’t been a real effort as far as I can see. Most people immediately respond with “AGI is different from humans for X,Y,Z reasons” (which are true) and then proceed to throw out the baby with the bathwater by not looking into human value formation at all.

Planes don’t fly like birds, but we sure as hell studied birds to make them.

If you come up with a strategy for how to do this then I’m much more interested, and that’s a big reason why I’m asking for a summary since I think you might have tried to express something like this in the post that I’m missing.

This is their current research direction, The shard theory of human values which they’re currently making posts on.

• The difference is people have been trying hard to harness nuclear forces for energy, while people have not been trying hard to research humans for alignment in the same way. Even relative to the size of the alignment field being far smaller, there hasn’t been a real effort as far as I can see. Most people immediately respond with “AGI is different from humans for X,Y,Z reasons” (which are true) and then proceed to throw out the baby with the bathwater by not looking into human value formation at all.

I’m not sure why you would think this. The actual funding that goes into trying to do this is not that large; fusion research funding is maybe like \$500 million/​yr. The FTX Future Fund alone will probably spend on the order of this much money this year, for instance. Most of these proposals are aimed at one very specific way of trying to exploit the binding energy (turn hydrogen isotopes into helium and other heavier elements) and don’t consider alternatives.

I think this approach is basically correct because I don’t see any plausible alternative that anyone has come up with. Fusion is promising because we know it happens in nature, we can trigger it under extreme conditions, and there’s an obvious mechanical explanation for why it would work. The only challenge is an engineering one, of doing it in a controlled way.

If “humans are an untapped source of evidence for alignment” or any similar claim is going to have teeth, it needs to be coupled with a more concrete strategy about how we should go about extracting this evidence, and I’m not sure where I’m supposed to get this from the post. I would be highly surprised if anyone said “evidence from humans is irrelevant to alignment”; I think the actual reason people don’t go down this path is because they don’t think it’s promising, much like the people who don’t spend billions of dollars exploring possibilities of cold fusion.

Planes don’t fly like birds, but we sure as hell studied birds to make them.

I don’t think this is actually as clear as you might think it is. As far as I can see birds are useful for designing planes in only the most superficial way; that is “you need to be pushing air downwards so you can fly”. Birds do this in a way that is different from helicopters, and planes do it in a way that’s different from both. You don’t need to have seen birds to know this, though, because conservation of momentum gets you to the same conclusion pretty easily.

I don’t really see how any deeper study of birds would have helped you to better design planes. If anything, bird flight is so complicated that studying it could have made it harder to design planes because you’d try to replicate how birds do it instead of solving the problem from first principles, e.g. by trying to figure out which airfoil shapes would deflect an air current downward.

This is their current research direction, The shard theory of human values which they’re currently making posts on.

Sure, but why do they think it’s a promising direction of research? This is what’s not clear to me and this post hasn’t helped make it any clearer, though the shortcoming there at least partly lies with me due to my inability to understand what’s being said.

• There’s apparently some controversy over what the Wright brothers were able to infer from studying birds. From Wikipedia:

On the basis of observation, Wilbur concluded that birds changed the angle of the ends of their wings to make their bodies roll right or left.[34] The brothers decided this would also be a good way for a flying machine to turn – to “bank” or “lean” into the turn just like a bird – and just like a person riding a bicycle, an experience with which they were thoroughly familiar. Equally important, they hoped this method would enable recovery when the wind tilted the machine to one side (lateral balance). They puzzled over how to achieve the same effect with man-made wings and eventually discovered wing-warping when Wilbur idly twisted a long inner-tube box at the bicycle shop.[35]

Other aeronautical investigators regarded flight as if it were not so different from surface locomotion, except the surface would be elevated. They thought in terms of a ship’s rudder for steering, while the flying machine remained essentially level in the air, as did a train or an automobile or a ship at the surface. The idea of deliberately leaning, or rolling, to one side seemed either undesirable or did not enter their thinking...

Wilbur claimed they learned about plausible control mechanisms from studying birds. However, ~40 years after their first flight, Orville would go on to contradict that assertion and claim that they weren’t able to draw any useful ideas from birds.

There are many reasons why I think shard theory is a promising line of research. I’ll just list some of them without defending them in any particular depth (that’s what the shard theory sequence is for):

• I expect that there’s more convergence in the space of effective learning algorithms than there is in the space of non-learning systems. This is ultimately due to the simplicity prior, which we can apply to the space of learning systems. Those learning algorithms which are best able to generalize are those which are simple, for the same reason that simple hypotheses are more likely to generalize. I thus expect there to be more convergence between the learning dynamics of artificial and natural intelligences than most alignment researchers seem to assume.

• The more I think about them, the less weird values seem. They do not look like a hack or a kludge to me, and it seems increasingly likely that a broad class of agentic learning systems will converge to similar values meta-dynamics. I think evolution put essentially zero effort into giving us “unusual” meta-preferences, and that the meta-preferences that we do have are pretty typical in the space of possible learning systems.

• To be clear, I’m not saying that AIs will naturally converge to first order human values. I’m saying they’ll have computational structures that have very similar higher-order dynamics to human values, but could be orientated towards completely different things.

• I think that imperfect value alignment does not lead to certain doom. I reject the notion of there being any “true” values which exist as ephemeral Platonic ideal inaccessible to our normal introspective process.

• Rather, I think we have something like a continuous distribution over possible values which we could instantiate in different circumstances. Much of the felt sense of value fragility arises from a type error of trying to represent a continuous distribution with a finite set of samples from that distribution.

• The consequence of this is that it’s possible for two distributions to partially overlap. In contrast, if you think that humans have some finite set of “true” values (which we don’t know), that AIs have some finite set of “true” values (which we can’t control), and that these need to near-perfectly overlap or the AIs will Goodheart away all the future’s value, then the prospects for value alignment would look grim indeed!

• I think that inner values are relatively predictable, conditional on knowing the outer optimization criteria and learning environment. Yudkowsky makes frequent reference to how evolution failed to align humans to maximizing inclusive genetic fitness, and that this implies inner values have no predictable relationship with outer optimization criteria. I think it’s a mistake to anchor our expectations of inner /​ outer value outcomes in AIs to evolution’s outcome in humans. Evidence from human inner /​ outer value outcomes seems like the much more relevant comparison to me.

• Similarly, I don’t think we’ll get a “sharp left turn” from AI training, so I more strongly expect that work on value aligning current AI systems will extend to superhuman AI systems, and that human-like learning dynamics will not totally go out the window once we reach superintelligence.

• As far as I can see birds are useful for designing planes in only the most superficial way

“The Wright Brothers spent a great deal of time observing birds in flight. They noticed that birds soared into the wind and that the air flowing over the curved surface of their wings created lift. Birds change the shape of their wings to turn and maneuver. The Wrights believed that they could use this technique to obtain roll control by warping, or changing the shape, of a portion of the wing.” (from a NASA website for kids, but I’ve seen this claim in lots of other places too)

• I’ve seen this claim too, but I’ve also seen claims that historically obsession with bird flight was something that slowed down progress into investigations of how to achieve flight. On net I don’t think I make any updates on this evidence, unless I get a compelling account for why bird flight provides us with a particular insight that would have been considerably more difficult to get from another direction.

Edit: I also think the observations the Wrights are said to have made are rather superficial, and could not have been useful for much more than a flash of insight which shows them that a particular way of solving the problem is possible.

What would not be superficial is if they did careful investigations of the shape of bird wings, derived some general model of how much lift an airfoil would generate from that, and then used that model to produce a prototype to start optimizing from. Is there any evidence for this happening?

• If birds didn’t exist (& insects etc.), maybe it never would have occurred to people that heavier-than-air flight was possible in the first place.

Hard to know the counterfactual though.

• This sounds implausible to me. I agree it would have taken longer to occur to people, but arguing that at some time people wouldn’t have figured out how to make helicopters or planes seems difficult to believe.

I’m curious why people would think this, though. Why is the possibility of flight and the basic mechanism of “pushing air downward” supposed to be so difficult, either conceptually or as a matter of engineering, that we couldn’t have achieved it without the concrete example of birds and insects?

• Why is the possibility of flight and the basic mechanism of “pushing air downward” supposed to be so difficult, either conceptually or as a matter of engineering, that we couldn’t have achieved it without the concrete example of birds and insects?

Because you need evidence to raise a hypothesis (like “heavier-than-air flight”) to consideration, and also social proof /​ funding to get people to take the ideas seriously. In hindsight the concept is obvious to you, as are the other clues by which other people could obviously have noticed the possibility of flight. That’s not how it feels to be in their place, though, without birds existing to constantly remind them of that possibility.

• Out of curiosity, where do you think people got the idea of going to the moon from? By your logic, since we never saw any animal go to the moon, how to do so shouldn’t have been obvious to us and it should have been extremely difficult to secure funding for such a project, no?

• I’m not saying that flight wouldn’t have happened at all without birds to look to. I’m saying that I think it would have taken somewhat longer, measured in years—decades.

• I think this is plausible, especially if you make the range of “somewhat longer” so big that it encompasses more than an order of magnitude of time, as in years—decades. It’s still not obvious to me, though.

• If we didn’t even have the verb “to fly”, and nobody had seen something fly, “going up and travelling sideways while hovering some distance above the ground” would have been a weird niche idea, and people like the Wright Brothers would have probably never even heard of it. It could have easily taken decades longer.

• I think people would have noticed feathers, paper, or folded sheets of paper hovering above the ground for long periods of time; people would have been able to flap their arms and feel the upward force and then attach large slabs and test how much the upward force was increased; people would have had time to study the ergonomics of thrown objects. Maybe it would have taken longer, but I think flight still would have been done, in less than a “few decades” later than it took for the wright brothers to figure it out.

Reason+capitalism is surprisingly resilient to setbacks like these.

• I strongly disagree with this counterfactual and would happily put up large sums of money if only it were possible to bet on the outcome of some experiment on this basis.

Humans designed lots of systems that have no analog whatsoever in nature. We didn’t need to see objects similar to computers to design computers, for instance. We didn’t even need to see animals that do locomotion using wheels to design the wheel!

It’s just so implausible that people would not have had this idea at the start of the 20th century if people hadn’t seen animals flying. I’m surprised that people actually believe this to be the case.

• To be fair, we did have animals that served the purpose of computers. We even called them computers—as in, people whose job it was to do calculations (typically Linear Algebra or Calculus or Differential Equations—hard stuff).

• This is true, but if this level of similarity is going to count, I think there are natural “examples” of pretty much anything you could ever hope to build. It doesn’t seem helpful to me when thinking about the counterfactual I brought up.

• To add, Turntrout does state:

In an upcoming post, I’ll discuss one particularly rich vein of evidence provided by humans.

so the doc Ulisse provided is a decent write-up about just that, but there are more official posts intended to published.

• How can I, a person who is better at introspection than basically anything else, help you with the shard theory project? I actually can explain in detail—at least, the kind of detail accessible to me, which doesn’t include e.g. neuron firing patterns—how I developed some of my values, or I can at least use reliable methods to figure out good hypotheses on the matter.

• I can’t speak for Alex and Quintin, but I think if you were able to figure out how values like “caring about other humans” or generalizations like “caring about all sentient life” formed for you from hard-coded reward signals that would be useful. Maybe ask on the shard theory discord, also read their document if you haven’t already, maybe you’ll come up with your own research ideas.

• I joined the discord just a few hours ago, in fact! Hopefully I’ll be of some use. (And I’ve read the doc before, but probably should reread it every so often.)

• My model of how human values form:

Precondition: The brain has already figured out how the body works and some rough world model, say at the level of a small child. It has concepts of space and actions that can meet its basic needs by, e.g., looking for food and getting and eating it. It has a concept of other agents but no concept for interacting with them yet.

The brain learns to predict that other agents (parents, siblings...) will act to (help) get its needs met by acting in certain ways, e.g., by smiling, crying, or what else works.

The brain learns to predict that other agents act more reliably positively if expectations of the other agent are consistently met. We say that the child forms relationships. Parents may help children do that, but the brain is incentivized to figure it out in any case.

The brain notices and learns to predict that actions that work for one agent lead to negative results if observed by other agents (taking away something from your younger brother helps get your needs met but may have negative results if observed by parents). The prediction error of expected reward reinforces behaviors that work across multiple agents (over a suitable time). We call typical classes of behaviors roles like “being a good student/​sibling/​friend,” and such behaviors are fostered—but given enough group interactions, the brain is incentivized to discover and reinforce such behaviors without that.

Kegan writes that many adults get stuck at this stage. My guess is because they are not exposed to enough of the next development.

The brain learns to predict that some behaviors don’t work for some groups of people. For example, when traveling to another city or country, actions may lead to big reward prediction errors. Behaviors that work with the other group are reinforced, and the brain will learn more general strategies. A big class of these strategies are called values that work across most societies. Society tries to teach these general strategies, but the brain is incentivized to discover these—given enough variable group interactions—without education. There are failure modes like being value flexible, i.e., doing what works locally (“When in Rome do as the Romans do”) if the groups are too distinct but the convergence is toward value-like strategies.

• 22 Jul 2022 0:50 UTC
LW: 2 AF: 2
0 ∶ 0
AF

After more discussion with bmk, I appended the following edit:

In this post, I wrote about the Arbital article’s unsupported jump from “Build an AI which cares about a simple object like diamonds” to “Let’s think about ontology identification for AIXI-tl.” The point is not that there is no valid reason to consider the latter, but that the jump, as written, seemed evidence-starved. For separate reasons, I currently think that ontology identification is unattractive in some ways, but this post isn’t meant to argue against that framing in general. The main point of the post is that humans provide tons of evidence about alignment, by virtue of containing guaranteed-to-exist mechanisms which produce e.g. their values around diamonds.

• Looks like some of the protein computers ended up with your values, even. Small universe, huh?

I’ve noticed that this “protein computers” framing makes it a lot intuitively easier to think about where humans are situated in the space of intelligent algorithms.

E.g., it’s intuitively harder to think about an unaligned AGI manipulating its way past humans than it is to think about unaligned AGI optimizing the arrangement of protein computers in its vicinity. In the “humans” framing, killing all the humans is the central turning point in the takeover story. In the “protein computers” framing, it’s just the point at which it makes more sense to put in the work to scrap and replace the arrangement of protein computers with newer silicon (or whatever) computers.

• It is not that human values are particularly stable. It’s that humans themselves are pretty limited. Within that context, we identify the stable parts of ourselves as “our human values”.

If we lift that stability—if we allow humans arbitrary self-modification and intelligence increase—the parts of us that are stable will change, and will likely not include much of our current values. New entities, new attractors.

• It is not that human values are particularly stable

I might agree or disagree with this statement, depending on what “particularly stable” means. (Also, is there a portion of my post which seems to hinge on “stability”?)

we identify the stable parts of ourselves as “our human values”.

I don’t see why you think this.

if we allow humans arbitrary self-modification and intelligence increase—the parts of us that are stable will change, and will likely not include much of our current values.

Do you predict that if I had access to a range of pills which changed my values to whatever I wanted, and I could somehow understand the consequences of each pill (the paperclip pill, the yay-killing pill, …), I would choose a pill such that my new values would be almost completely unaligned with my old values?

• Do you predict that if I had access to a range of pills which changed my values to whatever I wanted, and I could somehow understand the consequences of each pill (the paperclip pill, the yay-killing pill, …), I would choose a pill such that my new values would be almost completely unaligned with my old values?

This is the wrong angle, I feel (though it’s the angle I introduced, so apologies!). The following should better articulate my thoughts:

We have an AI-CEO money maximiser, rewarded by the stock price ticker as a reward function. As long as the AI is constrained and weak, it continues to increase the value of the company; when it becomes powerful, it wireheads and takes over the stock price ticker.

Now that wireheading is a perfectly correct extrapolation of its reward function; it hasn’t “changed” its reward function, it simply has gained the ability to control its environment well, so that it now can decorrelate the stock ticker from the company value.

Notice the similarity with humans who develop contraception so they can enjoy sex without risking childbirth. Their previous “values” seemed to be a bundle of “have children, enjoy sex” and this has now been wireheaded into “enjoy sex”.

Is this a correct extrapolation of prior values? In retrospect, according to our current values, it seems to mainly be the case. But some people strongly disagree even today, and, if you’d done a survey of people before contraception, you’d have got a lot of mixed responses (especially if you’d got effective childbirth medicine long before contraceptives). And if we want to say that the “true” values have been maintained, we’d have to parse the survey data in specific ways, that others may argue with.

So we like to think that we’ve maintained our “true values” across these various “model splinterings”, but it seems more that what we’ve maintained has been retrospectively designated as “true values”. I won’t go the whole hog of saying “humans are rationalising beings, rather than rational ones”, but there is at least some truth to that, so it’s never fully clear what our “true values” really were in the past.

So if you see humans as examples of entities that maintain their values across ontology changes and model splinterings, I would strongly disagree. If you see them as entities that sorta-kinda maintain and adjust their values, preserving something of what happened before, then I agree. That to me is value extrapolation, for which humans have shown a certain skill (and many failings). And I’m very interested in automating that, though I’m sceptical that the purely human version of it can extrapolate all the way up to superintelligence.

• Hm, thanks for the additional comment, but I mostly think we are using words and frames differently, and disagree with my understanding of what you think values are.

We have an AI-CEO money maximiser, rewarded by the stock price ticker as a reward function. As long as the AI is constrained and weak, it continues to increase the value of the company; when it becomes powerful, it wireheads and takes over the stock price ticker.

Their previous “values” seemed to be a bundle of “have children, enjoy sex” and this has now been wireheaded into “enjoy sex”.

I think this is not what happened. Those desires are likely downstream of past reinforcement of different kinds; I do not think there is a “wireheading” mechanism here. Wireheading is a very specific kind of antecedent-computation-reinforcement chasing behavior, on my ontology.

I’m sceptical that the purely human version of it can extrapolate all the way up to superintelligence.

Not at all what I’m angling at. There’s a mechanistic generator for why humans navigate ontology shifts well (on my view). Learn about the generators, don’t copy the algorithm.

• Not at all what I’m angling at. There’s a mechanistic generator for why humans navigate ontology shifts well (on my view). Learn about the generators, don’t copy the algorithm.

I agree that humans navigate “model splinterings” quite well. But I actually think the algorithm might be more important than the generators. The generators comes from evolution and human experience in our actual world; this doesn’t seem like it would generalise. The algorithm itself, though, may very generalisable (potential analogy: humans have instinctive grasp of all numbers under five, due to various evolutionary pressures, but we produced the addition algorithm that is far more generalisable).

I’m not sure that we disagree much. We may just have different emphases and slightly different takes on the same question?

• I’m not sure that we disagree much.

Yes and no. I think most of our disagreements are probably like “what is instinctual?” and “what is the type signature of human values?” etc. And not on “should we understand what people are doing?”.

The generators comes from evolution and human experience in our actual world

By “generators”, I mean “the principles by which the algorithm operates”, which means the generators are found by studying the within-lifetime human learning process.

potential analogy: humans have instinctive grasp of all numbers under five, due to various evolutionary pressures

Dubious to me due to information inaccessibility & random initialization of neocortex (which is a thing I am reasonably confident in). I think it’s more likely that our architecture&compute&learning process makes it convergent to learn this quick ⇐ 5 number-sense.

• 14 Jul 2022 14:44 UTC
LW: 2 AF: -2
2 ∶ 1
AF

I am skeptical of your premise. I know of zero humans who terminally value “diamonds” as defined by their chemical constitution.

Indeed, diamonds are widely considered to be a fake scarce good, elevated to their current position by deceptive marketing and monopolistic practices. So this seems more like a case study of how humans’ desires to own scarce symbols of wealth have been manipulated to lead to an outcome that is misaligned with the original objective.

• I believe the diamond example is true, but not the best example to use. I bet it was mentioned because of the arbital article linked in the post.

The premise isn’t dependent on diamonds being terminal goals; it could easily be about valuing real life people or dogs or nature or real life anything. Writing an unbounded program that values real world objects is an open-problem in alignment; yet humans are a bounded program that values real world objects all of the time, millions of times a day.

The post argues that focusing on the causal explanations behind humans growing values is way more informative than other sources of information, because humans exist in reality and anchoring your thoughts to reality is more informative about reality.

• I just introspected. I am weakly attracted to the idea of acquiring diamonds. I therefore know of at least one human who values diamonds.

I never claimed that humans are hardwired to value diamonds. I pointed out that some people do value diamonds, and pointed out that true facts have guaranteed-to-exist explanations. If you’re interested in building a mind which values diamonds, first ask why some already-existing minds value diamonds.

• Do my values bind to objects in reality, like dogs, or do they bind to my mental representations of those objects at the current timestep?

You might say: You value the dog’s happiness over your mental representation of it, since if I gave you a button which made the dog sad, but made you believe the dog was happy, and another button which made the dog happy, but made you believe the dog was sad, you’d press the second button.

I say in response: You’ve shown that I value my current timestep estimation of the dog’s future happiness over my current timestep estimation of my future estimation of the dog’s happiness.

I think we can say that whenever I make any decision, I’m optimising my mental representation of the world after the decision has been made but before it has come into effect.

Maybe this is the same as saying my values bind to objects in reality, or maybe it’s different. I’m not sure.

• I feel like this connects with what Max Tegmark was talking about in his recent 80k hours podcast interview. The idea that we need hierarchical alignment across groups of humans (companies, governments, sets of humans + the programs they’ve written plus their ML models?) as well as just within AI systems. I think if you carefully design an experiment which would generalize from humans to AGI systems, you could potentially learn some valuable lessons.

• [ ]
[deleted]