Possibly Funny Comedian at the Apollo Theatre
Michael S.R. Kitti
Michael S.R. Kitti’s Shortform
14. Some problems, like ‘the AGI has an option that (looks to it like) it could successfully kill and replace the programmers to fully optimize over its environment’, seem like their natural order of appearance could be that they first appear only in fully dangerous domains. Really actually having a clear option to brain-level-persuade the operators or escape onto the Internet, build nanotech, and destroy all of humanity—in a way where you’re fully clear that you know the relevant facts, and estimate only a not-worth-it low probability of learning something which changes your preferred strategy if you bide your time another month while further growing in capability—is an option that first gets evaluated for real at the point where an AGI fully expects it can defeat its creators. We can try to manifest an echo of that apparent scenario in earlier toy domains. Trying to train by gradient descent against that behavior, in that toy domain, is something I’d expect to produce not-particularly-coherent local patches to thought processes, which would break with near-certainty inside a superintelligence generalizing far outside the training distribution and thinking very different thoughts. Also, programmers and operators themselves, who are used to operating in not-fully-dangerous domains, are operating out-of-distribution when they enter into dangerous ones; our methodologies may at that time break.
This seems like a training distribution problem with a high cost of being out-of-distribution. Is anyone working on what the internal state of the model looks like when it first ‘sees’ a dangerous option as viable? That seems like the moment where you catch it.
New quick take—how long?
Can an excellent haiku
Be what it’s meant for