Here are the problems I found (so far) with this approach and my attempt at solutions:
Problem 1: We won’t think of all possibilities for bad stuff happening, and we might miss a crucial failure mode in our text corpus.
Solution 1: We might try to outsource the creation of some of those possibilities to GPT-N. This would be a much bigger problem if GPT were explicitly optimizing against us, but it isn’t.
Problem 2: GPT-N might predict that the world ends in such a way that stuff like the headlines of newspapers don’t change, or that things go badly rapidly enough that humanity doesn’t have time to notice it.
Solution 1: we can ask GPT-N for the probabilities at fine-grained time slices, and of text modalities that respond quickly to events, like tweets or forum posts.
Solution 2: for the case where things go badly rapidly enough that we don’t notice it at all, we can ask GPT-N for the probability of pictures (say a picture of downtown New York, still intact with living humans). Pictures respond faster to events in the real world. The only way to beat this is with something like a complete takeover of all forms of electronic communication at the same time. Or with total complete destruction of everything at the same time, ideally we need to have one of the cameras up on a satellite, taking pictures of the earth, so it wouldn’t be immediately destroyed.
Solution 3: we need to sign all the items in our dataset with cryptographic keys stored on computers in Faraday cages. A Twitter account which publishes signed messages/pictures every day or so should be made by organizations and serve as “dead man indicators”. This ensures that a potential fast doom scenario cannot only be electronic, it needs to be physical.
Problem 3: The optimization pressure scales with the number of questions we ask of GPT-N, especially if the questions we ask are informed by GPT’s past refusals or answers. If we get to a situation where GPT refuses to answer all our questions, and we keep asking them, then we are optimizing against the safety conditions.
Solution 1: this means that pretty much the first thing we need to ask GPT-N is how to safely extend the number of questions we can ask.
Here are the problems I found (so far) with this approach and my attempt at solutions:
Problem 1: We won’t think of all possibilities for bad stuff happening, and we might miss a crucial failure mode in our text corpus.
Solution 1: We might try to outsource the creation of some of those possibilities to GPT-N. This would be a much bigger problem if GPT were explicitly optimizing against us, but it isn’t.
Problem 2: GPT-N might predict that the world ends in such a way that stuff like the headlines of newspapers don’t change, or that things go badly rapidly enough that humanity doesn’t have time to notice it.
Solution 1: we can ask GPT-N for the probabilities at fine-grained time slices, and of text modalities that respond quickly to events, like tweets or forum posts.
Solution 2: for the case where things go badly rapidly enough that we don’t notice it at all, we can ask GPT-N for the probability of pictures (say a picture of downtown New York, still intact with living humans). Pictures respond faster to events in the real world. The only way to beat this is with something like a complete takeover of all forms of electronic communication at the same time. Or with total complete destruction of everything at the same time, ideally we need to have one of the cameras up on a satellite, taking pictures of the earth, so it wouldn’t be immediately destroyed.
Solution 3: we need to sign all the items in our dataset with cryptographic keys stored on computers in Faraday cages. A Twitter account which publishes signed messages/pictures every day or so should be made by organizations and serve as “dead man indicators”. This ensures that a potential fast doom scenario cannot only be electronic, it needs to be physical.
Problem 3: The optimization pressure scales with the number of questions we ask of GPT-N, especially if the questions we ask are informed by GPT’s past refusals or answers. If we get to a situation where GPT refuses to answer all our questions, and we keep asking them, then we are optimizing against the safety conditions.
Solution 1: this means that pretty much the first thing we need to ask GPT-N is how to safely extend the number of questions we can ask.