Too Many Metaphors: A Case for Plain Talk in AI Safety

I changed the title of the post from “Yudkowsky does not do alignment justice” to “Too Many Metaphors: A Case for Plain Talk in AI Safety”. In hindsight, the title might have been a bit too sensationalist and not clear. Yudkowsky has done more for the field than most. The purpose of this post is not to disregard any of his words, but rather to share observations and ideas about how to potentially communicate the field’s relevance more effectively.

For the past few months, I have been immersing myself in the AI safety debate here on LessWrong as a part of a research project. I was somewhat familiar with the topic of alignment, and had previously read Stuart Russell and listened to a few podcasts with Paul Christiano and Eliezer Yudkowsky. Starting off, I read AGI Ruin: A list of lethalities, where I was prompted to familiarise myself with the Orthogonality Thesis and Instrumental Convergence. Upon reading Yudkowsky’s posts on these two key topics, they made immediate sense, and the risk associated with a lack of alignment in sufficiently intelligent systems seemed apparent to follow from this.

At a high level, the issue can be framed simply: Current utility functions do not optimise toward values in which humans are treated as morally valuable.

From this point on, assuming exponential growth in AI systems, there are a multitude of scenarios one can imagine in which a system optimises towards a goal in which humanity is a casualty as a result of the optimisation process (e.g., build a whole brain emulator). I see this risk, and I see other current risks such as those related to open weights models with alignment measures fine-tuned away being accessible to bad actors. However, I believe Yudkowsky’s current communication approach may be counterproductive when talking about alignment, especially in formats where people outside the field are being introduced to alignment.

For instance, I just listened to a debate between Stephen Wolfram and Yudkowsky in which the majority of the discussion circled around defining positions. I must say that in these four hours, Yudkowsky was able to get some major points across; however, most of the content was obscured by metaphors. Metaphors can be valuable, but when talking about AI safety, I think the most effective way of explaining the risk is to stick to the basics. In this specific podcast, there were a lot of anthropomorphic comparisons made between AGI/ASI and European settlers of America. I see the comparison—European settlers had goals that were incompatible with the Native Americans’ existence (to some degree). They wanted “gold”, and Native American casualties were a part of the path towards gold. However, when you already have concrete knock-down arguments, so to speak, with orthogonality and instrumental convergence, you should stick to this.

For my project, I have been working with group members who have not immersed themselves to the same extent as I have in the AI safety debate, and often, they use these anthropomorphic comparisons (and similar futuristic outlooks, e.g., nanobots) as points to dismiss the position people like Yudkowsky hold. I think this is uncharitable; however, it seems to be a reality. A goal of communicating risks associated with developing AI systems should be to spread attention AND convince them that this is worth their time thinking about.

To finish off, I have to say that Eliezer Yudkowsky has made invaluable contributions to the AI safety discussion, and that I am grateful for his work. Without him, it would probably not be on my mind. However, communicating these ideas should be simpler. Abstract scenarios should not be the gold standard. The space of possible risk scenarios is nearly endless, and trying to represent this space by presenting abstract examples like nanobot-doom or paperclip maximisers does not convince people, as they are not able to emotionally attach themselves to these scenarios.

The discussion should shift towards the concrete, not the abstract, to convince newcomers.