Naive Hypotheses on AI Alignment

Apparently doominess works for my brain, cause Eliezer Yudkowsky’s AGI Ruin: A List of Lethalities convinced me to look in to AI safety. Either I’d find out he’s wrong, and there is no problem. Or he’s right, and I need to reevaluate my life priorities.


After a month of sporadic reading, I’ve learned the field is considered to be in a state of preparadigmicity. In other words, we don’t know *how* to think about the problem yet, and thus novelty comes at a premium. The best way to generate novel ideas is to pull in people from other disciplines. In my case that’s computational psychology: modeling people like agents. And I’ve mostly applied this to video games. My Pareto frontier is “modeling people like agents based on their behavior logs in constructed games created to trigger reward signals + ITT’ing the hell out of all the new people I love to constantly meet”. I have no idea if this background makes me more or less likely to generate a new idea that’s useful to solving AI alignment, but the way I understand the problem now: everyone should at least try.

So I started studying AI alignment, but quickly realized there is a trade-off: The more I learn, the harder it is to think of anything new. At first I had a lot of naive ideas on how to solve the alignment problem. As I learned more about the field, my ideas all crumbled. At the same time, I can’t really assess yet if there is a useful level of novelty in my naive hypotheses. I’m still currently generating ideas low on “contamination” by existing thought (cause I’m new), but also low on quality (cause I’m new). As I learn more, I’ll start generating higher quality hypotheses, but these are likely to become increasingly constrained to the existing schools of thought, because of cognitive contamination from everyone reading the same material and thinking in similar ways. Which is exactly the thing we want to avoid at this stage.

Therefore, to get the best of both worlds, I figured I’d write down my naive hypotheses as I have them, and keep studying at the same time. Maybe an ostensibly “stupid” idea on my end, inspires someone with more experience to a workable idea on their end. Even if the probability of that is <0.1%, it’s still worth it. Cause, you know, …. I prefer we don’t all die.

So here goes:

H1 - Emotional Empathy

If you give a human absolute power, there is a small subset of humans that actually cares and will try to make everyone’s life better according to their own wishes. This is a trait in a subset of humans. What is this trait, and can we integrate it in to the reward function of an AGI?

  • Does the trait rely on lack of meta-cognition? Does this trait show up equally at various IQ levels or does it peak at certain IQ levels? If the trait is less common at higher IQ levels, then this is probably a dead end. If the trait is more common at higher IQ levels, then there might be something to it.

  • First candidate for this trait is “emotional empathy”, a trait that hitches one’s reward system to that of another organism. Emotional empathy that we wire in to the AGI would need to be universal to all humanity, and not biased, like the human implementation.

H2 - Silo AI

Silo the hardware and functionality of AGI to particular tasks. Like governments are run in trifecta to avoid corruption. Like humans need to collaborate to make things greater than themselves. Similarly, limit AGI to functions and physicalities that force it to work together with multiple other, independent AGI’s to achieve any change in the world.

  • Counterargument: Silo’ed AI is effectively Tool AI, to which Gwern has written a counterargument that people won’t develop Tool AI cause it will always be worse than Agent AI.

  • Maybe that’s what we need to police? And the police would then effectively be a Nanny AI, so then we still need to solve for making a Nanny AI to keep all other AGI silo’ed. (This is all turning very “one ring to rule them all”...).

H3 - Kill Switch

Kill switch! Treat AGI like the next cold war. Make a perfect kill switch, where any massive failure state according to humans would blow up the entire sphere of existence of humans and AGI.

  • This strategy would block out the “kill all humans” strategies the AGI might come up with, cause it would destroy their own existence. They should be prioritizing their existence cause of instrumental convergence (whatever goal you are maximizing, you very likely need to exist to maximize it, so self-preservation is very most likely a goal any AGI will have).

  • What possible kill switch could we create that wouldn’t be trivially circumvented by something smarter than us? Intuitively I have the sense, a non-circumventable kill switch should exist, but what would that look like?

H4 - Human Alignment

AI alignment currently seems intractable because any alignment formula we come up with is inherently inconsistent cause humans are inconsistent. We can solve AI alignment by solving what humanity’s alignment actually is.

  • We can’t ask humans about their alignment because most individual humans do not have consistent internal alignments they can be questioned on. Some very few do, but this seems to be an exception. Thus, we can’t make a weighted function of humanity’s alignment by summing all the individual alignments of humans. Therefore, humanity at large does not have one alignment. (Related: Coherent Extrapolated Volition doesn’t converge for all of humanity)

  • Can we extrapolate humanity’s alignment from the process that shaped us: Evolution?

    • Evolution as gene proliferation function: Many humans do not share this as their explicit life goal but most common human goals still indirectly maximize our genetic offspring. For instance, accumulating wealth, discovering new technology, solidifying social bonds, etc. If AGI can directly help us to spread our genes, would that make most of our other drives vestigial? What would the AGI be propagating if the resulting offspring wouldn’t have similar drives to ourselves, including the vestigial ones?

    • However, more is not always better: There are very many pigs and very many ants. I think humans would rather be happier or smarter than simply more. Optimizing over happiness seems perverse, cause happiness is simply the reward signal for taking actions with high (supposed) survival and proliferation values. Optimizing over happiness would inevitably lead to a brain in a vat of heroin. Happiness should be a motivational tool, not a motivational goal.

    • Extrapolating our evolutionary path: Let AGI push us more steps up the evolutionary ladder, where we may survive in more different environments and flourish toward new heights. Thus, an AGI would engineer humans into a new species. This would creep most people out, while transhumanists would be throwing a party. It effectively comes down to AGI being the next step on the evolutionary ladder, and asking it to bring us with it instead of exterminating us. (note: we most probably were not that kind to our ancestors).

Thoughts on Corrigibility

Still learning about it at the moment, but my limited understanding so far is:

How to create an AI that is smarter than us at solving our problems, but dumber than us at interpreting our goals.

In other words, how do we constrain an AI with respect to its cognition about its goals?

Side Thoughts—Researcher Bias

Do AGI optimists and pessimists differ in some dimension of personality or cognitive traits? It’s well established that political and ideological voting behavior correlate to personality. So if the same is true for AI risk stance, then this might point to a potential confounder is AI risk predictions.


My thanks goes out to Leon Lang and Jan Kirchner for encouraging my beginner theorizing, discussing the details of each idea, and pointing me toward related essays and papers.