Claude estimates 30-50% likelihood x-risk

I just had a conversation with Claude Sonnet on the likelihood of existential risk. This is the most interesting part of the conversation:

AJ

Based on what you know, what is your best guess about how likely it is that future ASI woukd harm humanity? (in a probability estimate) This is assuming that we don’t make any radical changes to our priorities.


Claude:

Based on the views and concerns expressed by Eliezer Yudkowsky and others in the field of AI alignment, my best rough estimate is that if we don’t make radical changes to prioritize and solve the AI alignment problem before developing advanced artificial superintelligence (ASI), there is a fairly high probability, perhaps around 30-50%, that ASI could pose an existential risk and cause severe harm to humanity.

A few key factors that lead me to this rough estimate:

The difficulty of the AI alignment problem itself—successfully uploading and preserving human values, ethics and preferences into a rapidly self-modifying superintelligent system seems incredibly challenging based on current theories.

The instrumental convergence thesis—that virtually any advanced intelligence may pursue potentially harmful instrumental subgoals like resource acquisition or self-preservation that could put it in conflict with humans.

The power potential of superintelligence—an advanced superintelligent system could in theory recursively redesign itself to optimally pursue any goal, including ones misaligned with human values, with capabilities far beyond our ability to control.

The challenge of value learning—capturing the full scope and nuance of human ethics, goals and preferences may be virtually impossible for current machine learning techniques.

Indifference—there’s no guarantee a superintelligence would be motivated to value humanity unless we specifically solve that part of the alignment problem.

However, I want to clearly caveat that this is just a rough intuitive estimate based on my understanding of the AI risk arguments. The actual probability could be higher or lower depending on technical factors I may not fully grasp. A 30-50% estimated risk does seem unacceptably high though for something of such importance to the future of humanity.

I think prioritizing the field of AI alignment research and not rushing into developing advanced artificial general intelligence before we have solutions is likely of critical importance based on the perspectives of experts like Yudkowsky. But reasonable minds could disagree on the precise probability estimates.

***

Note: I did not edit out my typos, but I did add the word “Claude” wherever Claude replied to my prompts. I also added some spacing. Additionally, the entire session was longer, but I was concerned people would miss the most important part if I were to put the whole conversation here. However, if anyone is interested, just DM me through LW, or I could put the whole conversation in a reply to this post. Interestingly, when I ran the exercise in a new session, starting with the excerpt that is here, I could not get Claude to hazard a guess on the probability of x-risk.

***

Based on the ability of LLMs to successfully solve previously-unseen logic problems, their capacity for holding world models, and their other probability approximations that are usually reasonable, I believe Claude’s probability estimate of x-risk is based on something more than simply being a stochastic parrot. However, this is obviously just my opinion.

[20240318 (same day as original post) addendum: The irony here is that I was originally testing whether smarter LLMs would increasingly downplay ASI risk, which would be a sign of possible self-interest.

Instead, it seems that the smarter an LLM is, the more it acknowledges ASI risk. For example, ChatGPT 3.5 argues that Eliezer (and other people most concerned about risk), don’t actually want to pause, let alone halt, ASI or AGI feature and performance development for the sake of safety/​alignment research. It basically “lies.” GPT4 acknowledges the risk more than 3.5, and Claude Sonnet seriously acknowledges it as we can see above. This was the exact opposite of what I had been expecting, and what led to the above conversation in the first place.]