I’ve been reading a fair bit about “worse than death” scenarios from AGI (e.g. posts like this), and the intensities and probabilities of them. I’ve generally been under the impression that the worst-case scenarios have extremely low probabilities (i.e. would require some form of negative miracle to occur) and can be considered a form of Pascal’s mugging.
Recently, however, I came across this post on OpenAI’s blog. The blog post notes the following:
Bugs can optimize for bad behavior
One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form. This bug was remarkable since the result was not gibberish but maximally bad output.
This seems to be the exact type of issue that could cause a hyperexistential catastrophe. With this in mind, can we really consider the probability of this sort of scenario to be very small (as was previously believed)? Do we have a reason to believe that this is still highly unlikely to happen with an AGI? If not, would that suggest that current alignment work is net-negative in expectation?