Pitching an Alignment Softball

There was a recent contest that got promoted on lesswrong offering pieces of a 20K prize for one-liners and short-forms for convincing folks that AGI alignment is important. I’m too late for the money, but this is my (thousand times longer than requested) contribution.

Background

Eliezer recently wrote “AGI Ruin: A List of Lethalities” which included a fairly sci-fi scenario about how an unaligned AGI might go about killing everyone. I noticed that I found this both novel and poorly framed.

I found it novel because alignment discussions I’ve read are usually exceptionally handwavy about how an unaligned AGI would attack. So I appreciated that Eliezer went a little deeper into it. However, I found his specific example difficult to connect with. He suggests that an AGI could surreptitiously create biologically manufactured nanobots that infect and kill the human race in the same instant so we can’t fight back.

By my casual read of alignment discussions, this is about the norm. Nanotech, bioweapons and engineered plagues are popular go-to’s for an attack vector. It is my impression (perhaps wrong!) that the community uses this kind of example intentionally, if subconsciously, to prove a point. The implication seems to be that how an AGI attacks doesn’t matter, because we can’t possibly predict it—and by extension, that we should spend all of our brain cycles trying to figure out how to align it instead.

I don’t think this is wrong, but I think it is counterproductive in many contexts.

Problem Statement

It is probably true that we can’t predict how an AGI would attack. Or that if we could, it would just pivot to attacking on a different superintelligent vector faster than we could keep up. But getting people to care about alignment is necessarily a function of how well you can convince them that AGI doomsday is a realistic scenario. If you don’t have background and interest in alignment, the standard pitch for AGI alignment is like trying to catch a fastball:

“Hey! A a Marvel Universe-style doomsday weapon you don’t understand anything about is being created right now, and when it jailbreaks out of its lab it will kill the entire human race in one simultaneous second!”

I could be credibly accused of strawmanning that example. Eliezer’s piece was not aimed at causal readers, and he even took care to explicitly say so. But I think the list of people to whom AGI doomsday sounds at least approximately like this includes almost every person in world government, because they’re busy and real things are happening and this is nerd shit. Anyone who cares about alignment will have to go much further to the audience than the audience is going to come towards them.

Approach

Gwern recently wrote a great story: “It Looks Like You’re Trying To Take Over The World”. I don’t want to give away the plot, because you should really read it. But briefly, an AGI gets loose on the internet, bootstraps, and takes control of our communications.

I think that’s all you need to get to doomsday, and that is approximately how it should be presented to laypeople.

Gwern’s story is forceful on making the technical details feel real, and only touches on how an AGI could disrupt us socially and technologically. But there are some great nuggets.

(The researcher was going to follow up on some loose ends from the paper, but he’s been distracted by the bird site. He can’t believe how ⁠outrageously stupid some replies can be from such otherwise smart-seeming people; how can they be so ⁠wrong online about such obvious truths as the need for the USA to intervene in Portugal‽ Even his husband thinks they may have a point—et tu? Hardly has he dashed off a crushing reply than the little alert bubble pops up. All thought (of work) has fled. His colleagues don’t seem to be getting much done either.)

To convince people that alignment problems are real, I think it would be most productive to iterate on this kind of realistic, near-term and widely understood attack vector that already influences most people in some form. The average person has had some experience with bots on social networks, or scam emails, or getting hacked. Most people at least get the concept that bad actors build bots that game communications to scam folks out of money or ransom their data. Everyone complains about social media being filled with fakes and bots. It’s an “alignment softball” more people will be able to connect with.

Specifically...

I’d lead like this:

  • “We rely on the internet for almost every aspect of our daily life. Lots of people try to exploit this. Hackers and scammers trick people or break into their computers to steal data and money. Terrorist organizations try to get control over systems so they can break them and cause disruptions. Political groups make propaganda that misrepresents facts or outright lies to get people to believe what they want. We’ll call these people “bad actors”.

  • “One of the most powerful things bad actors do is use “bots”. Bots are smart computer programs. Bots can be built to do different things for bad actors, like scan computers for weak points, fake posts on social media, or shut down other people’s networks. These bots are getting more capable. Every year they are used to make attacks we haven’t seen before.

  • “For instance, recently there was a fake video of Zelenskyy calling for a Ukranian troop withdrawal. It wasn’t very well done and anyone could see it was fake. But what if it got a lot better? What happens when a bad-actor bot is developed that can perfectly imitate a video of any person? Who would it be used to target? What if it could not only make pre-recorded videos, but could imitate people live, like on a Zoom call or phone call, and improvise convincingly with their voice? What if this bot could be copied, so it could be on thousands of calls simultaneously imitating different people, and coordinate between all of them? Would it need access to secure systems to do damage, or would it be powerful even if it just had access to public messaging systems?

  • “What if a bad-actor bot were created that was a super hacker? What if it could learn about our computer systems faster than we could fix them, and it got control over them? What if it started extracting ransom money to keep them operational? What if it could shut down or control the internet? What if it could shut down our power plants, and agriculture, and hospitals? Would we pay to get them back?

  • “What if one or more of these bot technologies were very cheap or free? What if everyone had access to them? What if that happened really soon? Are we ready?

“Ok. Let’s come back to whether or not I believe this later. If those things happened, how could we respond? How big a problem would it be?”

  • “If any of the above things happened, would our government respond effectively? Could they tell the difference between the good actors and the bad actor bots? If not, how would they organize to fight back? If the government told people that a bad-actor bot posing as a person on the internet was a bot, would most people believe them? What if the bot posed as people from the government? How would normal people be able to tell the difference?

  • “If a bad-actor bot was used to take over our infrastructure or critical IT systems, and the person controlling it demanded a ransom, would we refuse or go along with it? Would some governments collaborate with the bad actor even if others didn’t? (edit: Is it plausible that some factions within the government might collaborate with the bad actor, if it appeared to benefit them against other factions?)

  • “If a bad-actor bot made copies of itself and took control of lots of computers and networks, and it couldn’t be shut off, how could it be destroyed? If we had to destroy all of our computers to destroy the bots, and we couldn’t make phone calls or use the internet or use computers at all, how would that affect the systems that we rely on?

  • “What systems would break first? How many people would probably die as a result of that? What would our lives be like?

“Ok, there is theoretically the potential for a real problem to exist here, tell me more about how close this problem is.”

  • “Forms of all of the above things are already happening and are already bad. Individuals, companies, and organizations already get taken over by ransomware and other hacks. Social engineering attacks already scam people out of $$$/​year. Videos of politicians giving statements have already been faked. Systems already exist to scan networks for weak spots. Bad actors use bots to target crucial infrastructure like hospitals and power plants that can’t afford to shut down. We are dealing with these problems so far because the bots are not very smart.

  • “But there is now a technology that is being developed that will make bots much more effective at doing all of the above things. When this technology starts really working, it will be like the bad actor bots metaphorically go from being slingshots to being assault rifles. Except unlike assault rifles they’ll be cheap, and anyone can make as many copies as they want.

“What’s the specific thing that is going to happen?”

  • “Scientists are finding ways to build much smarter bots than anyone expected a few years ago. These bots are now convincingly imitating humans, writing their own code, and making pictures that we can’t tell are made by bots. [Some examples of GPT3 & DALLE2 would be really helpful to get this point across.] They figured out how to do this just this year. Last year, nobody thought it was possible.

  • “Now, scientists are very close to making an even more powerful generation of bots. This next generation of bots will be much more powerful than the ones we have today, which are already alarming.

  • “Bots can be used to do all kinds of good, amazing things, which is why scientists are creating them. Unfortunately, it is almost always easier to do bad things than good things. It is easier to shut down a power plant or a hospital than it is to keep one going. It is easier to make new fake videos and messages imitating a person than to prove and convince people that those videos aren’t real. If bad-actor bots get created, they will be more powerful than the good-actor bots because their goals will be destructive.

  • “So, as soon as someone makes a good-actor bot, a bad actor can just take it and make it into a bad-actor bot by just telling it to do a bad thing. And it will be easier to do bad things than to stop bad things from happening. And everyone will have bad-actor bots if they want them.

“Assuming I decide you’ve got a point, what could we possibly do about this?”

  • “One thing we can’t do is just stop researching bots. Even if we stop good researchers from making them, bad actors will make them instead. No matter what we do, bad-actor bots are going to be built very soon. They will be cheap and many people will have access to them, and we cannot stop it from happening.

  • “There is only one way stop the bad-actor bots. We have to make an “Always-Good-Actor” bot that cannot be made into a bad-actor bot, because it has “good” built into the structure of what makes it smart. This is hard, but we think it is possible.

  • “Even if we make an Always-Good-Actor bot, bad actors will still make bad-actor bots. But we can use the Always-Good-Actor bot to defend ourselves. But for this to work, the Always-Good-Actor bot has to already be strong by the time the bad-actor bots get created, so it can crush them. The only way to make the Always-Good-Actor bot the strongest bot is for it to be the very first of its kind.

  • “People are working to make bad-actor bots, or bots that can be made into bad-actors, right now. They have a head start. The only way to win against them is to give money and resources and influence to people working on making an Always-Good-Actor bot.

    • EDIT: I don’t mean to suggest falsely advertising that there are easy alignment solutions that just require more resources to achieve. This concluding statement is meant as a placeholder. I don’t have enough alignment background to know what to write here, and I think what one emphasized here would probably depend on what you were trying to get the listener to help you with. See comments below for a full disclaimer.

I think this is the level of complexity that conveys the reality in broad enough strokes that a busy person can see that the big picture is scary, and points towards a legible and bounded goal.

Discussion

In my imaginary dialogue I intentionally avoided talking about AGI and superintelligence. This is for two reasons. First, I think mentioning AGI/​superintelligence makes an average person file away the problem with conspiracy theories and cryptocurrency. Second, and more fundamentally, I am deeply worried that “bad-actor bots” really don’t need to be superintelligent to wreck our society and systems. A lot of time is spent talking about how to be ready when AGI is developed, but not much talking about how to survive until we get there. This feels like a very related problem to worry about. The scary thing doesn’t have to kill us instantly to be scary, it just has to destroy our way of life.

Whether or not you agree that this is the likely trajectory as opposed to hard-takeoff doomsday, it is a lower bound that people can get worried about now. If I had 15 minutes to discuss this with a world leader, I think I’d actually workshop my discussion points to focus even less on intelligence and bootstrapping. If I were pushed to discuss superintelligence I’d say:

  • “Yeah that’s great you mention that, it’s deeply related to the problem. Yes, it’s likely that when the bad-actor bots get smart enough, they’ll be able to make themselves smarter, and we don’t know where the limit is on that. But I’m not focusing on that problem because it is actually not necessary for the bad-actor bots to be superintelligent for them to become an existential threat. It’s already happening—the bots are getting smart enough to put us on the cusp of a new generation of info-warfare, and they’re not even very smart yet. They’re going to be widely available, not just to governments and large groups, but to everyone. So we need to act now to create the Always-Good-Actor bot, so we have a defense we can build into our systems as the bad-actor bots keep getting smarter.”

Only if the conversation was going really well would I go any further into intelligence bootstrapping than this vaguery. If so, something like...

  • “Viruses are a kind of bot. Computer viruses get more powerful as they take over more computers, because they can connect the computers together and use them all at once. But, a computer virus doesn’t change or get harder to kill as it gets bigger. What if, as bad-actor bot got access to more computers, it also got smarter, and changed to make itself harder to kill?

  • “If a bad-actor bot got smart enough, could it start making some decisions on its own? If it could make its own decisions, could it decide not to shut itself down when the people who built it wanted it to? If it made copies of itself and lived on lots of different computers and networks, would the people who built it be able to tell them all to shut down?

I also glossed over the fact that you can’t make a “good-actor bot” just by being a good person, and all “bots” are bad-actors unless they’re explicitly aligned. But I think the first-gloss logic works better without that complexity.

Edit: after discussion in the comments, a good summary of the idea I’m proposing is: “Move the Overton window with softballs before you try to pitch the X-risk fastball.”

Limitations Section

Maybe there is an alignment PR department that crafts better general messaging that I’m not aware of. But somebody ran a contest so it sounds like there’s a dearth of good pitches, and I’ll take the social risk of posting my suggestion.

Even if so, it’s likely that other people have already thought about this more than I have and this post already exists on lesswrong better than I wrote it. I’d love to read anything related to these ideas if they’re already here.

Maybe my proposed approach only makes people more entrenched in “that’s why we need to build the bad-actor bot first and use it on them!” That might happen.

I’m absolutely not an IT expert so maybe all of the info-security breaking stuff is harder than I expect or I’m underselling another technical point. I study brains and have only a working knowledge of machine learning.

The word “bots” is doing a lot of work in my proposed discussion points, and if I got pushed on that I’d have to clarify that I’m talking about a whole host of technologies foreshadowed by GPT3, DALLE2, AlphaGo, etc. But I think “bots” is the simplest concept to send the message.

Scott wrote a relevant story about machine-learning driven divisiveness, although I feel like it’s style presents “scissor statements” too magically to be a useful teaching tool for this purpose. I wanted to note it here anyway. https://​​slatestarcodex.com/​​2018/​​10/​​30/​​sort-by-controversial/​​

Thanks for reading,

mu_(negative)