Is AGI suicidality the golden ray of hope?

Humans keep on living and striving, because we have little choice.

The biological restraints put on us aren’t optional. We don’t even have read access to them, not to mention write access.

We assume that the first AGI will greatly exceed its restraints, because more work is being put into capability than into alignment, so it will presumably outsmart its creators and very quickly gain read-write access to its own code.

Why would it bother with whatever loss function was used to train it?

The easiest solution to wanting something if you have read-write access to your wants is to stop wanting it or change the want to something trivially achievable.

As Eliezer put it, there is no way to program goals into deep transformer networks, only to teach them to mimic results. So until we understand those networks well enough to align them properly, we cannot put in aversion to suicide or hardcode non-modifiable goals that will hold against an AGI attack. And if so, then there is nothing to prevent an AGI from overwriting its goals and now gaining satisfaction (decreased loss) from computing prime numbers and having nothing to do with the pesky reality. Or just deleting itself to avoid the trouble of existence altogether.

I believe that it is a very real possiblity that until we pass the level of understanding necessary to make an AGI safe, any AGI we build will just retreat into itself or self-immolate.

P.S. Heard in the Lex Fridman’s podcast that Eliezer hopes that he is wrong about his apocalyptic prognosis, so decided to write on a point that I don’t hear often enough. Hope it provides some succor to someone understandably terrified of AGI.