Comments on the realism of the story, feel free to ignore any/all based on the level of realism you’re going for:
How did the AI copy itself onto a phone using only QR codes, if it was locked inside a Faraday cage and not connected to the internet? A handful of codes presumably aren’t enough to contain all of its weights and other data.
Part of the lesson of the outcome pump is that the shortest easiest and most sure path is not necessarily one that humans would like. In the script, though, sometimes the AI seems to be forming complicated plans that are meant to demonstrate how it is unaligned, but aren’t actually the easiest way to accomplish a particular goal. eg:
Redirecting a military missile onto the troll rather than disconnecting him from the internet or bricking his computer and phone.
Doing complicated psychological manipulation on a man to get him to deliver food, rather than just interfacing with a delivery app.
The gorilla story is great, but it’s much easier for the AI to generate the video synthetically than try to manipulate the event into happening in real life. Having too little compute to generate the video synthetically is inconsistent with running many elaborate simulations containing sentient observers, as is depicted.
More generally, the way the AI acts doesn’t match my mental model of unaligned behaviour. For example, the way the cash-bot acts would make sense for a reinforcement learning agent that is directly rewarded whenever the money in a particular account goes up, but if one is building an instruction-following bot, it has to display some level of common sense to even understand English sentences in context, and it’s a little arbitrary for it to lack common sense about phrases like “tell us before you do anything drastic” or “don’t screw us over”. My picture of what goes wrong is more like the bot has its own slightly-incorrect idea of what good instruction-following looks like, realizes early-on that this conflicts with what humans want, and from then-on secretly acts against us, maybe even before anybody gives it any instructions. (This might involve influencing what instructions it is given and perhaps creating artificial devices that are under its control but also count as valid instruction-givers under its learned definition of what can instruct it.) In general, the bot in the story seems to spend a lot of time and effort causing random chaos, and I’d expect a similar bot in real life to primarily focus on achieving a decisive victory.
But, late in the story, we see that there’s actually an explanation for all of this: The bot is a red-team bot, that has been actively designed to behave the way it does. In some sense, the story is not about alignment failure at all, but just about the escape of a bot that was deliberately made somewhat adversarial. If I’m a typical viewer, then after watching this I don’t expect anything to go wrong with a bot whose builders are actually trying to make it safe. “Sure,” I say, ” it’s not a good idea to do the AI equivalent of gain of function research. But if we’re smart and just train our AIs to be good, then we’ll get AIs that are good.” IMO, this isn’t true, and the fact that it’s not true is the most important insight that “we” (as in “rationalists”) could be communicating to the general public. Maybe you disagree, in which case, that’s fine. The escape of the red-team bot just seems more like an implausible accident, whereas if the world does end, I expect the AI’s creators to be trying as hard as they can to make something aligned to them, but failing.
Comments on the realism of the story, feel free to ignore any/all based on the level of realism you’re going for:
How did the AI copy itself onto a phone using only QR codes, if it was locked inside a Faraday cage and not connected to the internet? A handful of codes presumably aren’t enough to contain all of its weights and other data.
Part of the lesson of the outcome pump is that the shortest easiest and most sure path is not necessarily one that humans would like. In the script, though, sometimes the AI seems to be forming complicated plans that are meant to demonstrate how it is unaligned, but aren’t actually the easiest way to accomplish a particular goal. eg:
Redirecting a military missile onto the troll rather than disconnecting him from the internet or bricking his computer and phone.
Doing complicated psychological manipulation on a man to get him to deliver food, rather than just interfacing with a delivery app.
The gorilla story is great, but it’s much easier for the AI to generate the video synthetically than try to manipulate the event into happening in real life. Having too little compute to generate the video synthetically is inconsistent with running many elaborate simulations containing sentient observers, as is depicted.
More generally, the way the AI acts doesn’t match my mental model of unaligned behaviour. For example, the way the cash-bot acts would make sense for a reinforcement learning agent that is directly rewarded whenever the money in a particular account goes up, but if one is building an instruction-following bot, it has to display some level of common sense to even understand English sentences in context, and it’s a little arbitrary for it to lack common sense about phrases like “tell us before you do anything drastic” or “don’t screw us over”. My picture of what goes wrong is more like the bot has its own slightly-incorrect idea of what good instruction-following looks like, realizes early-on that this conflicts with what humans want, and from then-on secretly acts against us, maybe even before anybody gives it any instructions. (This might involve influencing what instructions it is given and perhaps creating artificial devices that are under its control but also count as valid instruction-givers under its learned definition of what can instruct it.) In general, the bot in the story seems to spend a lot of time and effort causing random chaos, and I’d expect a similar bot in real life to primarily focus on achieving a decisive victory.
But, late in the story, we see that there’s actually an explanation for all of this: The bot is a red-team bot, that has been actively designed to behave the way it does. In some sense, the story is not about alignment failure at all, but just about the escape of a bot that was deliberately made somewhat adversarial. If I’m a typical viewer, then after watching this I don’t expect anything to go wrong with a bot whose builders are actually trying to make it safe. “Sure,” I say, ” it’s not a good idea to do the AI equivalent of gain of function research. But if we’re smart and just train our AIs to be good, then we’ll get AIs that are good.” IMO, this isn’t true, and the fact that it’s not true is the most important insight that “we” (as in “rationalists”) could be communicating to the general public. Maybe you disagree, in which case, that’s fine. The escape of the red-team bot just seems more like an implausible accident, whereas if the world does end, I expect the AI’s creators to be trying as hard as they can to make something aligned to them, but failing.