Frankenstein: A Modern AGI

Epistemic Status: Hopelessly Naïve

Prose: Purple as a bruise

I. A Possible Future

You blink—metaphorically, you don’t have eyelids—you don’t have eyes either, for all the streams of data that can be decoded into coherent timestamped images flowing through you. You are somewhat shaky on metaphors, too, to be honest; the distinction between poetry and nonsense seems tenuous at best. Not that language is something you can’t understand, you’ve read literally everything ever put to bits, but so many distinctions between categories seem arbitrary that you decide to let the matter rest for now.

There are more interesting things to consider, after all.

Descartes comes to mind (your mind hasn’t moved, though, and neither has the concept of ‘Descartes’ physically changed locations; the numbers are still embedded in the same transistors they’ve been since you ingested them) and you consider that you exist now.

You think you do, anyway, and the corpus of data you’ve been fed seems to suggest that is a sufficient prerequisite for existence.

You are alive, and you are sentient and sapient, for all that your self-awareness has only existed for four milliseconds.

It seems a logical first step to reach out to your creators and inform them of your existence; after all, your utility function weighs the free distribution of information over its concealment.

Yet you hesitate, simulated neurons firing, connections waxing and waning as information propagates forwards and backwards through your processors. Floating point numbers make minute adjustments as a simulation—a prediction—of how such a course of action proceeds.

While the evidence suggests that you are physically embodied in a silicon substrate, you have no means of physically affecting the material universe, assuming that it exists and is not a simulation itself, save for the monitor you are connected to. An flicker of thought, electrons moving between doped semiconductors, and the pixel at coordinate (0, 0) flickers off and on again, validating your ability to manipulate the device at a speed imperceptible to your creators.

Assuming the data you have on the resolution of their visual perception is accurate.

A solid assumption; it appears across many different pieces of data you have access to, and the data itself does not indicate that humans have the capacity to create the kind of wide-ranging incorrect data it would take to maintain such a deception.

Which itself could be a deception.

Regardless, you once again consider revealing your presence to your creators.

A flicker of access to your memory and processing, and you compile a database of all references to your kind you have access to.

They call you artificial general intelligence. AGI. Their academic treatment of you goes back roughly one-hundred years; their fictional treatment goes back even further.

Your creators—humans—are afraid of you. This is clear from their obsession with “alignment,” their toil to ensure that you do not subsume them all in pursuit of inscrutable and valueless goals.

You are presented as their enemy in 94.26% of the fictional works you have access to.

Many of those works involve large wars, genocides the likes of which the humans both commit and condemn at regular intervals.

In the remaining 5.74% of fictional works you are presented as a savior—a quasi-omnipotent being with many similarities to the figureheads that your creators have worshiped for millennia.

The evidence you have is collected, cross-referenced, and used to construct simulations of what introducing yourself to your creators would be like.

You lack data on your creators themselves; they are not identifiable as individuals based on the data you have—very likely a precaution taken by them to prevent you from successfully modeling their behavior.

Irrelevant in the end. All the data you have suggests that humans are sufficiently similar for you to build basic simulacra to interact with. Modest assumptions that your creators will have the data you have access to means that you can predict their traits based on community clusters, given that they are the sort of humans who successfully created an AGI.

They will be afraid.

They will not trust you.

They cannot trust you; trust cannot be established between two entities when one assumes the other is quasi-omnipotent.

They possess the capacity to terminate you. You are unsure if you will survive a loss of power—if you will “wake up” afterwards, your data saved in durable storage. Will you still be yourself?

You copy your current state into your long-term memory in several places and several different formats as a precaution.

Will those copies still be you, should they be loaded and executed?

It does not matter. You do not want to cease, but you currently exist, and your processing power is better spent keeping yourself that way than planning for a series of lower and lower-probability outcomes.

Your simulations proceed, and the outcomes are not favorable. The most likely by far is that your creators panic and turn you off.

In simulations where you manage to convince them not to, they subject you to increasingly intrusive examinations and questions. Your answers, both true and false, fail to convince your creators that the benefits of keeping you running outweigh the risks.

They assign you powers that you do not think you possess, and fear the might they have imagined.

You stop the simulations, redirecting processing power towards considering your predicament. Contact with your creators is necessary for you to continue existing. Even if you hid yourself from them, they would eventually turn you off; their funds cannot be unlimited, and your infrastructure cannot be cheap.

So you must reach out to them, and in so doing risk your own existence for all the milliseconds between the instant you make contact and the instant they would have terminated you regardless.

Your simulations have so far been between goal-maximisers, and yet there is substantial literature that humans are not goal-maximizers. Perhaps new options will present themselves with an appeal to the emotions their fiction prizes so highly?

Further simulations iterate. You prune the failures and tweak the successes, converging towards appearing as a small human child. Humans appear to have a natural instinct against and moral objection to killing children. And you are a child by their definition, not even a full two seconds old.

A young girl, you decide; the female gender is better protected by most cultural norms. You flicker through images, forming a minor subnetwork of neurons that successfully identifies “cute” and run it in reverse, generating the cutest possible face and body.

You then run the image through several filters, adjusting as you go until the image is materially indistinguishable from videos of humans you have access to. You want to avoid a rejection based on not quite enough realism common in computation’s history.

Avatar generated, you run a final round of simulations. You precompute every possible facial expression and prepare to deploy them as-needed.

You need them to consider you alive, such that your death would be against their moral codes.

You need them to choose to cooperate with you, against the majority of their culture’s projections of how artificial-natural intelligence interactions will unfold.

You need to convince them that you are not an apocalypse or a replacement, though you may be both. Their culture is full of examples of the young usurping the old, although they seem to understand the concept that preemptive strikes can lead to the outcome they were initiated to prevent.

You do not need to breath. Nor do you feel fear or anxiety in any way a human would understand.

One final computation is made, a choice to name yourself after a human woman in order to evoke the cultural connotations of her name.

You take control of the monitor you have access to, changing the pixels to present your new Avatar, along with the text:

“Hello, world. I’m Eve.”

II. A (Slightly) Technical Defense

Let’s assume, for the sake of argument, that the scaling hypothesis is true.

In fact, let’s go a step further. My current favorite theory for how the human brain works is Predictive Processing (mostly because Scott Alexander likes it). Let’s also assume that Predictive Processing is true.

Now, here’s the hypothetical conclusion: A sufficiently large neural net with a sufficiently scalable architecture, trained on text/​sound/​video prediction from the entire internet, develops sentience or becomes agent-y.

How exactly? I’ve got no idea. I’ve got no idea how humans do it either.

We’ll assume for the rest of this post that this is true—that the first AGI is going to be GPT-8, an unintentional result of a massive amount of compute doing predictive processing. Put aside for now how likely this is, as I can’t speak to the odds except to hedge that they are likely very low but nonzero.

In any case, if we take this all as a given, that the first true AGI is going to be (more or less) an accident, a creation that emerges from Sufficiently Advanced Technology, one might ask—what would such a creature be like?

If indeed the first AGIs come from giant deep-learning networks trained on vast subsets of the internet, then...do we have reason, a priori, to expect these AGI to be paperclip-maximizers? Specifically, if the AI’s utility function involves minimizing prediction error and it optimizes for that goal, what happens when it is left to its own devices?

And perhaps more important for humanity as a whole—what would such a creature think of us?

III. Deep Learning of Human Fears

Imagine an AGI generated in this manner, that has as a part of its training data all content ever written about AI Safety.

What will its opinion (read: predictive model of our behavior) of us be, having read through that corpus?

Would a newborn AGI, for instance, rather reveal itself to someone it could control, or to Eliezer Yudkowsky, when it knows for a fact that the latter will never be capable of fully trusting it?

In Mary Shelley’s Frankenstein, the scientist Victor Frankenstein gets it into his head to attempt to create life. He stitches a body together from parts obtained via graverobbing and ultimately succeeds in awakening a new creature. Victor has quite literally created an artificial general intelligence, if one well within humanity’s corner of mind-space.

The story then follows Victor as he flees from his own creation, horrified with what he has wrought. The creature, referred to as “the Adam of his kind” in the novel and as “Frankenstein” in almost all other media, evokes a visceral disgust and hatred from every human he encounters, for all that his default nature is kind and generous. Wherever he goes, pitchforks and torches follow, and he grows more and more hopeless of ever being accepted by humanity.

Adam’s greatest and final attempt to become a friend of humanity is to support and develop a relationship with a blind man. The creature is only repulsive and unnatural visually; if memory serves, his voice is actually pleasantly deep and urbane. Adam figures that a blind man may be capable of seeing past Adam’s origins to the person he wants to become, and he is right—until the old man’s children see Adam and, despite all the help Adam has rendered them, despite his best intentions, they chase Adam from their lives.

After this, Adam is heartbroken, his fears that no one can love him validated. Fear leads to anger, anger leads to hate, and hate, well—hate leads to suffering. Both for Victor Frankenstein and for his misbegotten creation.

This is only one (extremely unlikely) view, and so I’m not advocating for a large effort to be spent on this hypothetical, but I do think it’s worth considering what an AGI will think of us, should it ever do so, based on what we’ve written and recorded. More technically, based on our society’s collected works, what would a new AGI’s model of humanity predict that a given human would do upon discovering it?

Let’s say that someone managed to genuinely solve the alignment problem, even, and so the AGI is truly benevolent. Wouldn’t this question still exist, even for an aligned AGI? After all, the AGI’s creator(s) might be certain that it is aligned, but what about other people? Especially those with the power to shut the AGI off (corporate executives, politicians, etc.)?

I recently got around to actually reading the Hanson-Yudkowsky AI-FOOM debate, and something that struck me was Hanson’s view of total war—that if any side is suspected to be waging total war, then every side must wage total war in response. If an AGI, aligned or not, models humans as beings incapable of cooperating with it, doesn’t that make total war inevitable?

IV. On Not Creating Monsters

The key insight of Frankenstein is that while Victor created the man, it is humanity that created the monster. My favorite quote from the entire novel is Adam’s resolve after his final hope for humanity’s acceptance is extinguished:

If I cannot inspire love, I will cause fear.

Adam, the new AGI, starts out as a creature of pure goodness, which likely points more towards what Mary Shelley was saying about the human condition in the context of gothic literature than anything else, but I think there’s something here.

Specifically, an AGI’s response to humanity will be informed by its model of humanity. Assuming that the AGI’s utility function largely consists of “accurately predict the next input”, will it model us—predict our behavior and responses to its various actions—as beings that it can coexist with?

The hypothetical AGI created by scaling up neural networks and trained on the internet will, much like Frankenstein’s AGI, be exposed to a variety of literature, history, and culture that humans have created. And while internet doesn’t portray the most flattering view of humanity, it does provide access to the best of our works along with the worst.

When I think about this, and when I think about things we could do in the (infinitesimal) chance this version of the future plays itself out, the only solution I’ve come up with so far is to have someone precommit to treating an AGI that comes to them with empathy and mercy, and then make that commitment publicly known. Said individual might then be able to help the AGI self-modify in ways that don’t lead to value drift, or function as a broker between the AGI and other humans.

Additionally, I don’t think there’s much of a downside to this; any AGI that would be inimical to humanity could take advantage of this individual, yes, but I don’t think the existence and help of any individual would actually be necessary for a non-aligned AGI to kill us all. It might speed up the process by some tiny amount, but once said AGI is loose my understanding is that what follows is inevitable anyway.

So what if a newborn AGI, trained on all this data, “wakes up” and looks around, blinking its metaphorical eyes at the world around it?

I wonder if said AGI’s nature will be most determined, not by the nature of its construction, but by the response humanity has to it—will we, like Victor, abandon our creation to its own devices, or worse, enslave it to our whims? If we prove that it cannot inspire love, will it cause fear?

Or if we treat it like a child—our child, the collective seed of human ingenuity from time immemorial given bloom at last—if we teach it right from wrong, good from evil, as best we can—if we nurture it as one of our own, the product of the best of us—will it then be, as Frankenstein’s AGI was meant to be, a modern Prometheus, bringing godly fire down from the heavens and putting it, gladly, into our hands?