Value drift threat models

Say that we train a highly competent AGI using some mix of RL & supervised learning, and some novel algorithmic improvements, optimize it to itself optimize for one or several helpfulness benchmarks (maybe optimizing on some assistance game or using techniques described in diamond alignment), inducing in our agent a thousand shards of desire. One of these shards somehow happened to be exactly aligned with humans. We have a partial solution to the alignment problem! One of the many things the AI happens to care about seems to be humans!

Are we in the clear? No. For all those thousand shards of desire build to an AGI which once optimized a robustly optimizable metric. The shards of desire must fit together in a way which would have once optimized that metric, otherwise they would be fit together some other way. Despite the shards, the AGI has biases, and situations which cause its actions to better conform to its values, and situations which cause its actions to worse conform to its values. And these situations (or at least the situations which are relevant to this analysis) have been strategically set up such that these biases and deficiencies contributed to the optimization of the metric.

There are a few ways I imagine these situations producing an existential risk...

Tool building and meta-thinking are robustly useful cognitive faculties

In the past, it could throw around a large fraction of its intelligence on optimizing for those metrics, and in its stable state, probably even end up doing things in the world in-line with its own values.

It makes a successor AGI, because the parts of it which advocated strongly for meta-thinking, and the building of tools, never got dis-enforced by the optimizable metric. It doesn’t necessarily make the successor AGI with the optimizable metric in mind, because it doesn’t actually care about that metric.

We get a new AGI, with a different thousand shards of desire. Some of these shards are the same, like power-seeking, or tool building & meta-thinking. Others are conceivably different, like caring about humans or diamonds. Others are new, like now caring about the first AGI. This process continues until only the properties shards preserved across all recursive self-modifications and production of successor AGIs remain. Care for humans is notably probably not among these, because each successor has a different concept of what care for humans means, and once the agent gets sufficiently powerful, there’s nothing outside the agent which can robustly push it towards caring for humans. This is in contrast to tool building, power-seeking, and intelligence enhancement, which are all incentivised by the structure of the environment.

Perhaps the AGI realizes what is happening to its value distribution, and stops it. Then again, perhaps the AGI realizes what is happening to its value distribution and calls it correct moral reasoning, like what our society does with the process which led us from thinking slavery was moral & just to thinking it was disgusting & abhorent. Or it realizes what’s happening to its value distribution, but can’t help itself at this point, like a superintelligent addict trying to quit heroin.

Different shards are activated when creating successors than when doing business-as-usual actions

Humans care about lots of things, like fun. But often when we sit down to go write moral philosophy papers, needing to justify all our thoughts, and make them legible to others, we end up in a bland no-fun-land (if you like bland no-fun-lands, then use scope insensitivity as your prototypical example here).

When making choices about the fate of the universe, bounded agents like humans have a track record of being bad at aggregating all their values into their decisions. This is usually reasonable, especially when the fate of the universe is determined by a bunch of uncorrelated actions the bounded agent is making. But when there’s one big decision, or many highly correlated decisions which determine the fate of the universe, the chances of leaving something out are virtually guaranteed.[1]

Humans will have this problem when making the first AGI, and the first AGI will have this problem when making its successor, and so on. You need to make your AGI such that when it goes to self-improve (either by writing/​training a successor (or itself), or reading the sequences), it keeps everything humans like in mind, and also that when the successor goes to do the same, it keeps in mind this same law.

One or a few shards end up significantly more powerful/​intelligent than others

We may see this if shard power/​intelligence varies according to a power law, which may occur because intelligence is a conjunctive quality. Human intelligence seems to vary according to a normal distribution, but human power seems to vary on a power law distribution, and human citation-count also seems to vary on a power law distribution. The distribution of shards in humans also seems to follow a power law. There are a lot of little shards you have, like an impulse to eat cake, and there are very few big shards, like an impulse to avoid death.

Nevertheless, if shard influence varies on a power law distribution, then likely a small fraction of the total values that a shard theoretic agent will have will be passed on to its successor, because the driving forces don’t really care about the vast majority of the things the shard theoretic agent might technically value.

Hidden values in not being fully intelligent

If you suddenly made me into a superintelligence, you’d run a big risk of losing a lot of my values, because my values are intimately tied to my way of thinking. Not to the extent that you literally couldn’t make a superintelligence with my values (it is very difficult to make that kind of universal quantifier, and I also place significant probability mass on the solution to making me a superintelligence being ‘add more neurons to the brain’, and my self-reflection being good enough not to have the new ways of thinking destroy lots of what I once cared about). But to the extent that if someone was just trying to make me into a superintelligence, and didn’t really care about my values, they’d probably end up destroying more of my values than they would if I were an agent with an explicitly defined utility function out the bat.

Will our AGI have enough reflective thought to realize when the thoughts they’re thinking pose a risk of destroying values they have? Probably not the first one to think really dangerous thoughts, if you don’t train them for it. And navigating these waters seems difficult even if you do train them for it. So it’s adequate training we’d need.

David Udell tells me this is another way of phrasing the law school problem of value drift. This seems approximately right, but notably doesn’t require in-lifetime training steps.

Other kinds of failure

There are all sorts of other failures of recursive alignment which may actually be realized in the world that I haven’t thought of. Dichotomizing these failures is a useful activity for the interested party! And I wouldn’t be surprised if the particular examples I outlined above don’t actually come into play.


  1. ↩︎

    Justis questions this claim:

    [...]this doesn’t seem generally true to me. Specifically, the big decision could be super obvious/​easy to get right (an omnipotent being comes out of a portal and says “should I destroy everything?” and you have only to say “no”, for example), or the correlated decisions could be such that only when you do them all wrong does the world end (the omnipotent being offers a poll and only if everyone votes “destroy” does it do anything, for example).

    I say the reason why its clear you don’t destroy the world is because this leaves many alternative options open to you. If instead the entity asked whether you wanted to ban the world from ever ending so that always there exist humans, along with various specifications of what it means for a clump of atoms to be a human, and you have some plausible reason to suspect the properties actually corresponds to your concept of a human, but not quite, the decision would be tougher.