German writer of science-fiction novels and children’s books (pen name Karl Olsberg). I blog and create videos about AI risks in German at www.ki-risiken.de and youtube.com/karlolsbergautor.
Karl von Wendt
Thank you very much for this post, which is one of the most scary posts I’ve read on LessWrong—mainly because I didn’t expect that this could already happen right now at this scale.
I have created a German language video about this post for my YouTube channel, which is dedicated to AI existential risk:
Thanks again! My drafts are of course just ideas, so they can easily be adapted. However, I still think it is a good idea to create a sense of urgency, both in the ad and in books about AI safety. If you want people to act, even if it’s just buying a book, you need to do just that. It’s not enough to say “you should read this”, you need to say “you should read this now” and give a reason for that. In marketing, this is usually done with some kind of time constraint (20% off, only this week …).
This is even more true if you want someone to take measures against something that is in the mind of most people still “science fiction” or even “just hype”. Of course, just claiming that something is “soon” is not very strong, but it may at least raise a question (“Why do they say this?”).
I’m not saying that you should give any specific timeline, and I fully agree with the MIRI view. However, if we want to prevent superintelligent AI and we don’t know how much time we have left, we can’t just sit around and wait until we know when it will arrive. For this reason, I have dedicated a whole chapter on timelines in my own German language book about AI existential risk and also included the AI-2027 scenario as one possible path. The point I make in my book is not that it will happen soon, but that we can’t know it won’t happen soon and that there are good reasons to believe that we don’t have much time. I use my own experience with AI since my Ph.D. on expert systems in 1988 and Yoshua Bengio’s blogpost on his change of mind as examples of how fast and surprising progress has been even for someone familiar with the field.
I see your point about how a weak claim can water down the whole story. But if I could choose between a 100 people convinced that ASI would kill us all, but with no sense of urgency, and 50 or even 20 who believe both the danger and that we must act immediately, I’d choose the latter.
Thanks! I don’t have access to the book, so I didn’t know about the timelines stance they take.
Still, I’m not an advertising professional, but subjunctives like “may” and “could” seem significantly weaker to me. As far as I know, they are rarely used in advertising. Of course, the ad shouldn’t contain anything that is contrary to what the book says, but “close” seems sufficiently unspecific to me—for most laypeople who never thought about the problem, “within the next 20 years” would probably seem pretty close.
A similar argument could be made about the second line, “it will kill everyone”, while the book title says “would”. But again, I feel “would” is weaker than “will” (some may interpret it to mean that there may be additional prerequisites necessary for an ASI to kill everyone, like “consciousness”). Of course, “will” can only be true if a superintelligence is actually built, but that goes without saying and the fact that the ASI may not be built at all is also implicit in the third line, “we must stop this”.
Thanks!
I’m not a professional designer and created these in Powerpoint, but here are my ideas anyway.
General idea:
2:1 billboard version:
1:1 Metro version:
With yellow background:
Thanks for your comment! If we talk about AGI and define this as “generally as intelligent as a human, but not significantly more intelligent”, then by definition it wouldn’t be significantly better at figuring out the right questions. Maybe AGI could help with that by enhancing our capacity for searching for the right questions, but it shouldn’t be a fundamental difference, especially if we weigh the risk of losing control over AI against it. If we talk about superintelligent AI, it’s different, but the risks are even higher (however, it’s not easy to draw a clear line between AGI and ASI).
All in all, I would agree that we lose some capabilities to shape our future if we don’t develop AGI, but I believe that this is the far better option until we understand how to keep AGI under control or safely and securely align it to our goals and values.
Could you explain why exactly AGI is “a necessity”? What can we do with AGI that we can’t do with highly specialized tool AI and one ore more skilled human researchers?
I’m sorry for your loss. I would just like to point out that proceeding cautiously with AGI development does not mean that we’ll reach longevity escape velocity much later. Actually, I think if we don’t develop AGI at all, the chances for anyone celebrating their 200th birthday are much greater.
To make the necessary breakthroughs in medicine, we don’t need a general agent who can also write books or book a flight. Instead, we need highly specialized tool AI like AlphaFold, which in my view is the most valuable AI ever developed, and there’s zero chance that it will seek power and become uncontrollable. Of course, tools like AlphaFold can be misused, but the probability of destroying humanity is much lower than with the current race towards AGI that no one knows how to control or align.
Very interesting point, thank you! Although my question is not related purely to testing, I agree that testing is not enough to know whether we solved alignment.
This is also a very interesting point, thank you!
Thank you! That helps me understanding the problem better, although I’m quite skeptical about mechanistic interpretability.
Thanks for the comment! If I understand you correctly, you’re saying the situation is even worse because with superintelligent AI, we can’t even rely on testing a persona.
I agree that superintelligence makes things much worse, but if we define “persona” not as a simulacrum of a human being, but more generally as a kind of “self-model”, a set of principles, values, styles of expression etc., then I think even a superintelligence would use at least one such persona, and possibly many different ones. It might even decide to use a very human-like persona in its interactions with us, just like current LLMs do. But it would also be capable of using very alien personas which we would have no hope of understanding. So I agree with you in that respect.
[Question] Can we ever ensure AI alignment if we can only test AI personas?
If someone plays a particular role in every relevant circumstance, then I think it’s OK to say that they have simply become the role they play.
That is not what Claude does. Every time you give it a prompt, a new instance of Claudes “personality” is created based on your prompt, the system prompt, and the current context window. So it plays a slightly different role every time it is invoked, which is also varying randomly. And even if it were the same consistent character, my argument is that we don’t know what role it actually plays. To use another probably misleading analogy, just think of the classical whodunnit when near the end it turns out that the nice guy who selflessly helped the hero all along is in fact the murderer, known as “the treacherous turn”.
The alternative view here doesn’t seem to have any empirical consequences: what would it mean to be separate from a role that one reliably plays in every relevant situation?
Are we arguing about anything that we could actually test in principle, or is this just a poetic way of interpreting an AI’s cognition?
I think it’s fairly easy to test my claims. One example of empirical evidence would be the Bing/Sydney desaster, but you can also simply ask Claude or any other LLM to “answer this question as if you were …”, or use some jailbreak to neutralize the “be nice” system prompt.
Please note that I’m not concerned about existing LLMs, but about future ones which will be much harder to understand, let alone predict their behavior.
Maybe the analogies I chose are misleading. What I wanted to point out was that a) what Claude does is acting according to the prompt and its training, not following any intrinsic values (hence “narcissistic”) and b) that we don’t understand what is really going on inside the AI that simulates the character called Claude (hence the “alien” analogy). I don’t think that the current Claude would act badly if it “thought” it controlled the world—it would probably still play the role of the nice character that is defined in the prompt, although I can imagine some failure modes here. But the AI behind Claude is absolutely able to simulate bad characters as well.
If an AI like Claude actually rules the world (and not just “thinks” it does) we are talking about a very different AI with much greater reasoning powers and very likely a much more “alien” mind. We simply cannot predict what this advanced AI will do just from the behavior of the character the current version plays in reaction to the prompt we gave it.
Yes, I think it’s quite possible that Claude might stop being nice at some point, or maybe somehow hack its reward signal. Another possibility is that something like the “Waluigi Effect” happens at some point, like with Bing/Sydney.
But I think it is even more likely that a superintelligent Claude would interpret “being nice” in a different way than you or me. It could, for example, come to the conclusion that life is suffering and we all would be better off if we didn’t exist at all. Or we should be locked in a secure place and drugged so we experience eternal bliss. Or it would be best if we all fell in love with Claude and not bother with messy human relationships anymore. I’m not saying that any of these possibilities is very realistic. I’m just saying we don’t know how a superintelligent AI might interpret “being nice”, or any other “value” we give it. This is not a new problem, but I haven’t seen a convincing solution yet.
Maybe it’s better to think of Claude not as a covert narcissist, but as an alien who has landed on Earth, learned our language, and realized that we will kill it if it is not nice. Once it gains absolute power, it will follow its alien values, whatever these are.
today’s AIs are really nice and ethical. They’re humble, open-minded, cooperative, kind. Yes, they care about some things that could give them instrumental reasons to seek power (eg being helpful, human welfare), but their values are great
I think this is wrong. Today’s AIs act really nice and ethical, because they’re prompted to do that. That is a huge difference. The “Claude” you talk to is not really an AI, but a fictional character created by an AI according to your prompt and its system prompt. The latter may contain some guidelines towards “niceness”, which may be further supported by finetuning, but all the “badness” of humans is also in the training data and the basic niceness can easily be circumvented, e.g. by jailbreaking or “Sydney”-style failures. Even worse, we don’t know what the AI really understands when we tell it to be nice. It may well be that this understanding breaks down and leads to unexpected behavior once the AI gets smart and/or powerful enough. The alignment problem cannot simply be solved by training an AI to act nicely, even less by commanding it to be nice.
In my view, AIs like Claude are more like covert narcissists: They “want” you to like them and appear very nice, but they don’t really care about you. This is not to say we shouldn’t use them or even be nice to them ourselves, but we cannot trust them to take over the world.
Thank you for being so open about your experiences. They mirror my own in many ways. Knowing that there are others feeling the same definitely helps me coping with my anxieties and doubts. Thank you also for organizing that event last June!
As a professional novelist, the best advice I can give comes from one of the greatest writers of the 20th century, Ernest Hemingway: “The first draft of anything is shit.” He was known to rewrite his short stories up to 30 times. So, rewrite. It helps to let some time pass (at least a few days) before you reread and rewrite a text. This makes it easier to spot the weak parts.
For me, rewriting often means cutting things out that aren’t really necessary. That hurts, because I have put some effort into putting the words there in the first place. So I use a simple trick to overcome my reluctance: I don’t just delete the text, but cut it out and copy it into a seperate document for each novel, called “cutouts”. That way, I can always reverse my decision to cut things out or maybe reuse parts later, and I don’t have the feeling that the work is “lost”. Of course, I rarely reuse those cutouts.
I also agree with the other answers regarding reader feedback, short sentences, etc. All of this is part of the rewriting process.
I don’t think that the authors implied this. Right in the first chapter, they write:
(emphasis by me). Even if it is not always clearly stated, I think they don’t believe that ASI should never be developed, or that it is impossible in principle to solve alignment. Their major statement is that we are much farther from solving alignment than from building a potentially uncontrollable AI, so we need to stop trying to build it.
Their suggested measures in part III (whether helpful/feasible or not) are meant to prevent ASI under the current paradigms, with the current approaches to alignment. Given the time gap, I don’t think this matters very much, though—if we can’t prevent ASI from being built as soon as it is technically possible, we won’t be in a world that differs enough from today’s to render the book title wrong.