German writer of science-fiction novels and children’s books (pen name Karl Olsberg). I blog and create videos about AI risks in German at www.ki-risiken.de and youtube.com/karlolsbergautor.
Karl von Wendt
I agree with most of this—red lines that aren’t respected are useless. However, stopping to draw red lines doesn’t solve any problems in my opinion—ignoring them or redrawing them is itself a signal. But I agree that what we need most is enforcement.
First of all, thank you for mentioning my post—I feel honored to serve as an example in this case! But to be clear, at the time I did not intend to define any specific red lines. I was just asking how we could decide when to stop development if we needed to.
I’m not sure whether you’re arguing against using red lines in general, or just want to point out that so far we haven’t broadly agreed on any and all talk of self-restraint by the industry has been just lip service (to which I agree). In any case, I’m still covinced that we need to define red lines for AI development that we must not cross. The fact that this hasn’t worked so far is absolutely no proof that such an approach is useless. It actually only proves that we need to do more to define, argue about, and agree upon such red lines.
Red lines are probably the most important concept in human civilization. From the Ten Commandments to tax law, by defining what we are not allowed to do, they are the foundation of our rules on how we deal with each other. Arguing that red lines for AI so far haven’t worked, therefore we shouldn’t even try to define them, is like saying someone got murdered, so criminal law is unnecessary.
If we assume that there is a “point of no return”, maybe a certain combination of generality and intellligence in an AI that leads to it becoming uncontrollable, and we haven’t solved alignment, then the only way to avoid an existential catastrophe is to not build this. Even if you think that alignment is in fact solved (or not really a problem), we should care about where this point of no return lies, so we know at what point we really need to be sure that you’re right about that. (And it should also be clear who can decide this—the current way of private companies gambling with the future of mankind for personal gain clearly violates the Universal Declaration of Human Rights in my view.)
It may be difficult to define this point exactly. But that only makes it even more important to draw red lines as quickly as possible, so we don’t accidentally stumble into an existential catastrophe. And by “red lines” I don’t mean “alarm signals which lead to a stop of development if detected” but specific rules for the decisions AI developers can make, e.g. how much training compute, what kinds of safety tests required, etc. This is no doubt a huge challenge, but that is no argument against trying to solve it. Saying “this is impossible” is just a self-fulfilling prophecy.
Central planners set targets in tons of nails produced or number of shoes, and factories duly maximized those numbers by making a few huge, unusable nails … that nominally met the plan.
This sounded improbable to me, and indeed seems wrong: https://skeptics.stackexchange.com/questions/22375/did-a-soviet-nail-factory-produce-useless-nails-to-improve-metrics Apparently, the “huge nails” originally appeared in a cartoon in 1954 in the satirical magazine Krokodil and were later turned into an urban legend.
Yes, I agree, although I don’t believe that COT reading will carry us very far into the future, it is already pretty unreliable and using it for optimization would ruin it completely.
Alignment is difficult, that is my whole point—with an emphasis on the hypothesis that with any kind of only reinforcement-learning-based approach, it is virtually impossible. IF we could find a way to create some kind of genuine “care about humans module” within an AI that is similar to the kind of parent-child-altruism that I write about, we might have a chance. But the problem is that no one knows how to do that, and even in humans it is a quite fragile mechanism.
One additional thought: Evolution has created the parent-child-care-mechanism through some kind of reinforcement-learning, but is optimizing for a different objective compared to our current AI training process—not any kind of direct evaluation of human behavior, but survival and reproduction. Maybe the evolution of spiral personas is closer to the way evolution works. But of course, in this case AI is a different species, a parasite, and we are the hosts.
I’m not a machine learning expert, so I’m not sure what exactly causes sycophancy. I don’t see it as a central problem of alignment; it is just a symptom of a deeper problem to me.
My point is more general: To achieve true alignment in the sense of an AI doing what it thinks is “best for us”, it is not sufficient to train it by rewarding behavior. Even it the AI is not sycophantic, it will pursue some goal that we trained into it, so to speak, and that goal will most likely not be what we would have wanted it to be in hindsight.
Contrast that with the way I behave towards my sons: I have no idea what their goals are, so I can’t say that my goals are “aligned” with their goals in any strict sense. Instead, I care about them, their wellbeing, but also their independence of me and their ability to find their own way to a good life. I don’t think this kind of intrinsic motivation can be “trained” into an AI with any kind of reinforcement learning.
What can we learn from parent-child-alignment for AI?
Tricky hypothesis 2: But the differences between the world of today and the world where ASI will be developed don’t matter for the prognosis.
I don’t think that the authors implied this. Right in the first chapter, they write:
If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything remotely like the present understanding of AI, then everyone, everywhere on Earth, will die.
(emphasis by me). Even if it is not always clearly stated, I think they don’t believe that ASI should never be developed, or that it is impossible in principle to solve alignment. Their major statement is that we are much farther from solving alignment than from building a potentially uncontrollable AI, so we need to stop trying to build it.
Their suggested measures in part III (whether helpful/feasible or not) are meant to prevent ASI under the current paradigms, with the current approaches to alignment. Given the time gap, I don’t think this matters very much, though—if we can’t prevent ASI from being built as soon as it is technically possible, we won’t be in a world that differs enough from today’s to render the book title wrong.
Thank you very much for this post, which is one of the most scary posts I’ve read on LessWrong—mainly because I didn’t expect that this could already happen right now at this scale.
I have created a German language video about this post for my YouTube channel, which is dedicated to AI existential risk:
Thanks again! My drafts are of course just ideas, so they can easily be adapted. However, I still think it is a good idea to create a sense of urgency, both in the ad and in books about AI safety. If you want people to act, even if it’s just buying a book, you need to do just that. It’s not enough to say “you should read this”, you need to say “you should read this now” and give a reason for that. In marketing, this is usually done with some kind of time constraint (20% off, only this week …).
This is even more true if you want someone to take measures against something that is in the mind of most people still “science fiction” or even “just hype”. Of course, just claiming that something is “soon” is not very strong, but it may at least raise a question (“Why do they say this?”).
I’m not saying that you should give any specific timeline, and I fully agree with the MIRI view. However, if we want to prevent superintelligent AI and we don’t know how much time we have left, we can’t just sit around and wait until we know when it will arrive. For this reason, I have dedicated a whole chapter on timelines in my own German language book about AI existential risk and also included the AI-2027 scenario as one possible path. The point I make in my book is not that it will happen soon, but that we can’t know it won’t happen soon and that there are good reasons to believe that we don’t have much time. I use my own experience with AI since my Ph.D. on expert systems in 1988 and Yoshua Bengio’s blogpost on his change of mind as examples of how fast and surprising progress has been even for someone familiar with the field.
I see your point about how a weak claim can water down the whole story. But if I could choose between a 100 people convinced that ASI would kill us all, but with no sense of urgency, and 50 or even 20 who believe both the danger and that we must act immediately, I’d choose the latter.
Thanks! I don’t have access to the book, so I didn’t know about the timelines stance they take.
Still, I’m not an advertising professional, but subjunctives like “may” and “could” seem significantly weaker to me. As far as I know, they are rarely used in advertising. Of course, the ad shouldn’t contain anything that is contrary to what the book says, but “close” seems sufficiently unspecific to me—for most laypeople who never thought about the problem, “within the next 20 years” would probably seem pretty close.
A similar argument could be made about the second line, “it will kill everyone”, while the book title says “would”. But again, I feel “would” is weaker than “will” (some may interpret it to mean that there may be additional prerequisites necessary for an ASI to kill everyone, like “consciousness”). Of course, “will” can only be true if a superintelligence is actually built, but that goes without saying and the fact that the ASI may not be built at all is also implicit in the third line, “we must stop this”.
Thanks!
I’m not a professional designer and created these in Powerpoint, but here are my ideas anyway.
General idea:
2:1 billboard version:
1:1 Metro version:
With yellow background:
Thanks for your comment! If we talk about AGI and define this as “generally as intelligent as a human, but not significantly more intelligent”, then by definition it wouldn’t be significantly better at figuring out the right questions. Maybe AGI could help with that by enhancing our capacity for searching for the right questions, but it shouldn’t be a fundamental difference, especially if we weigh the risk of losing control over AI against it. If we talk about superintelligent AI, it’s different, but the risks are even higher (however, it’s not easy to draw a clear line between AGI and ASI).
All in all, I would agree that we lose some capabilities to shape our future if we don’t develop AGI, but I believe that this is the far better option until we understand how to keep AGI under control or safely and securely align it to our goals and values.
Could you explain why exactly AGI is “a necessity”? What can we do with AGI that we can’t do with highly specialized tool AI and one ore more skilled human researchers?
I’m sorry for your loss. I would just like to point out that proceeding cautiously with AGI development does not mean that we’ll reach longevity escape velocity much later. Actually, I think if we don’t develop AGI at all, the chances for anyone celebrating their 200th birthday are much greater.
To make the necessary breakthroughs in medicine, we don’t need a general agent who can also write books or book a flight. Instead, we need highly specialized tool AI like AlphaFold, which in my view is the most valuable AI ever developed, and there’s zero chance that it will seek power and become uncontrollable. Of course, tools like AlphaFold can be misused, but the probability of destroying humanity is much lower than with the current race towards AGI that no one knows how to control or align.
Very interesting point, thank you! Although my question is not related purely to testing, I agree that testing is not enough to know whether we solved alignment.
This is also a very interesting point, thank you!
Thank you! That helps me understanding the problem better, although I’m quite skeptical about mechanistic interpretability.
Thanks for the comment! If I understand you correctly, you’re saying the situation is even worse because with superintelligent AI, we can’t even rely on testing a persona.
I agree that superintelligence makes things much worse, but if we define “persona” not as a simulacrum of a human being, but more generally as a kind of “self-model”, a set of principles, values, styles of expression etc., then I think even a superintelligence would use at least one such persona, and possibly many different ones. It might even decide to use a very human-like persona in its interactions with us, just like current LLMs do. But it would also be capable of using very alien personas which we would have no hope of understanding. So I agree with you in that respect.
Maybe we have a different understanding of the term “red line”. For me, it describes something that a human should not do, rather than an event that shouldn’t happen. So if someone releases an unsafe model, a red line is crossed when the model violates some defined safety specification, not when there’s a tragic event (which may or may not be a result of crossing the red line). However, I agree that in both cases there’s the danger of increasing numbness, so too much red line-drawing which then is simply ignored is indeed bad.