Balancing exploration and resistance to memetic threats after AGI

I wrote this in January 2024 as a memo for the 2024 Summit on Existential Security. Will MacAskill suggested in conversation that I post it publicly, so here it is (modulo minor revisions)! Thanks to Will MacAskill, Ben Hilton, and Lionel Levine for comments.

Suppose we get an aligned superintelligence and are in a position to spread our civilization to the stars. It has been proposed that we should continue to reflect on our values, so that we don’t end up locking in a future that we wouldn’t want upon further reflection. This seems correct to me, but also seems to be difficult to do well, with a substantial probability of a catastrophic outcome as a result of a poorly-executed reflection process. This is because there are two competing goals:

Thorough exploration of the space of value systems, so that most people consider a good value system at some point during the exploration process.
Trap avoidance: by a trap, I mean a value system that is (1) viral: so memetically fit that merely considering it is likely to lock you into permanently adopting that value system; and (2) bad: not close to as good as the sorts of value systems people would adopt in a “perfect” reflection process.

(Note that some ideas are problematically viral, but not so viral that merely considering them would permanently lock you into adopting them. Such ideas could be a large source of value loss in the far future.)

See the appendix for a formalization of these goals.

Why would traps exist?

It might be counterintuitive that traps could exist. But I think that by default, they do. That’s because I think viral value systems exist, and I don’t expect a strong correlation between virality and goodness. (I probably expect a positive correlation, but not one that’s strong enough that most viral values are really good. Also, maybe the correlation turns negative at the extreme tail of virality.)

Why I think viral value systems exist

It is easy to imagine a value system V with the property that V “tells you” to consider no changes to your values. That is, if you fully adopt V, you will never again consider any other values.
- In practice, this doesn’t work nowadays, because people aren’t perfect at obeying their value systems. Maybe they think that they ought not to expose themselves to other ideas, but they do so anyway. But this becomes much more of a risk given sophisticated mind control technology.
The fact that billions of people believe in Christianity and Islam is evidence of their virality. The spread of Christianity and Islam can be attributed in part to power imbalance (for example, Christian missionaries having much more sophisticated technology than the populations they were contacting); however, my understanding is that a significant part of the spread of these religions was due to their memetic fitness.
- I think more research into the virality of Christianity and Islam—how viral they are, and what makes them viral—might help paint a more detailed picture. So please leave a comment if you’re knowledgeable about this!
It will be possible to optimize really hard for virality once we have superintelligent AI, e.g. by running simulations where you introduce people to new ideas. The space of ideas that we’ve explored so far is extremely small compared to the space of ideas that we will explore, and it would be really surprising if we’ve already stumbled on ideas that are close to the upper limit of virality. (Our defenses will also be stronger, but it’s really unclear to me which way the offense-defense balance will shake out here.)

Why I think many viral value systems will be bad

I guess I think the burden of explanation here is on the other side: why would it be the case that the vast majority of viral value systems are not bad? Considering which ideas have been the most viral over human history, I think they have an okay but not great track record.

Why these goals are in tension

To satisfy goal 1 (exploration), it is necessary to have something like a duty to listen. In other words, if another person or entity comes up with a new value system that might be substantially better than your current value system, you should consider the new value system. Without a duty to listen, it seems likely that people will end up settling on their final value system prematurely.

To satisfy goal 2 (trap avoidance), it is necessary to avoid considering traps (because a trap has the property that if you consider it, you’re likely to permanently adopt it).

In other words, you need people to consider a broad enough array of ideas that they end up listening to a really good idea at least once, while avoiding all bad + viral ideas. This is really hard, because it somehow forces you to distinguish between really good ideas and bad + viral ideas! This is extra hard because the best ideas will probably be viral! (At least to the extent that it’s generally easier to convince people of true things than of false things.)

To summarize, solving this problem likely involves distinguishing good + viral ideas from bad + viral ideas, without considering them. That sounds really hard!

Why this problem might be solvable

The problem, as I just stated it, might seem impossible: how can you distinguish ideas without considering them? There’s a bit of slipperiness about what exactly I mean by “consider”, but the relevant notion of “consideration” is: entertaining the idea sufficiently that, if it is viral, you have a large chance of “catching the virus” (adopting the idea).

Potential solution via mechanistic interpretability

I work at the Alignment Research Center, where we sometimes think about mechanistic anomaly detection (MAD): using the internals of a neural network to determine whether the neural net got high reward on an input for the usual reasons (doing a good job) or for an anomalous reason (such tricking the reward model into thinking it did a good job, while doing something else, see e.g. sensor tampering).

Now, imagine a human brain as a neural net. It seems quite plausible that you could do something similar to MAD, if you had sufficiently good interpretability on human brains: you could look at cases where a human became convinced of a claim or value, and try to distinguish cases where the human became convinced of the claim “for good reasons” (e.g. the human followed an argument that seemed sound to the human) versus “for bad reasons” (e.g. considering the idea triggered some sort of exploit/backdoor in the human’s brain).

So imagine that AI capabilities are strong enough to be able to simulate humans considering new ideas. If you can look at a human’s internal state (as represented inside the AI) and determine that the human is being convinced of a claim or value “for bad reasons”, then the argument can be flagged as anomalous. Once you have an anomaly screening process in place, maybe you can “scan” arguments for anomalies before exposing humans to them.

Potential solution via clever design

You could imagine designing a reflection process that circumvents this issue, e.g. by setting up systems that make all ideas much less viral.

Potential solution via moral trade

(Thanks to Will Macaskill for this idea) Let’s say you have 100 clever design ideas for the reflection process, but you’re not sure if they’re going to work. You could imagine running them all in parallel in different slices of the universe. Suppose that 10% of them actually end up with good values, while 90% end up in traps. It’s possible that there’s enough gains to be had from moral trade that the slices can trade and the good values end up 90% satisfied in the whole universe.

Potentially useful tool: zero-knowledge proofs or arguments

(Thanks to Will Macaskill for suggesting something along these lines) Let’s say that Alice and Bob agree that a good value system should have property P. Alice claims that she has found a value system V that satisfies P. From Bob’s perspective, either Alice is telling the truth, or Alice is trying to spread V (which is viral) to Bob by trying to get Bob to consider V. Bob can ask Alice for a zero-knowledge proof or argument that V satisfies P, so that Bob can become convinced that V satisfies P without being exposed to the details of V.

Appendix: formalizing thorough exploration and trap avoidance

Let us introduce some terms:

Postulate/loose definition: We will imagine a hypothetical “perfect” reflective process (PRP). This is the sort of thing that an outside observer of our universe would look at and be like, “yup, that turned out great, nothing went wrong during reflection and humanity realized its full potential”.

Definition: A value system is good for a person P if it is at least half as good as the value system that P ends up with in the PRP. A value system is bad for a person P if it is at most 1% as good as the value system that P ends up with in the PRP. A value system V is for P viral if merely considering V causes P to permanently adopt P with >50% likelihood. A value system V is a trap for P if it is both viral for P and bad for P.

Definition: A person (or AI or other entity) is informed if they have considered (but not necessarily adopted) a value system that is good for them.

Then we can formalize the two goals as follows:

Thorough exploration: a majority of people become informed in the next billion years.
Trap avoidance: among people who are informed in a billion years, fewer than half consider a value system that is a trap for them.