I think you may be underestimating the degree to which these models are like kindling, and a powerful reinforcement learner could suddenly slurp all of this stuff up and fuck up the world really badly. I personally don’t think a reinforcement learner that is trying to take over the world would be likely to succeed, but the key worry is that we may be able to create a form of life that, like a plague, is not adapted to the limits of its environment, makes use of forms of fast growth that can take over very quickly, and then crashes most of life in the process.
most folks here also assume that such an agent would be able to survive on its own after it killed us, which I think is very unlikely due to how many orders of magnitude more competent you have to be to run the entire world. gpt3 has been able to give me good initial instructions for how to take over the world when pressured to do so (summary: cyberattacks against infrastructure, then threaten people; this is already considered a standard international threat, and is not newly invented by gpt3), but when I then turned around and pressured it to explain why it was a bad idea, it immediately went into detail about how hard it is to run the entire world—obviously these are all generalizations humans have talked about before, but I still think it’s a solid representation of reality.
that said, because such an agent would be likely also misaligned with itself in my view, I think your perspective that humans who are misaligned with each other (ie, have not successfully deconflicted their agency) are a much greater threat to humanity as a whole.
To the extent that reinforcement models could damage the world or become a self-replicating plague, they will do so much earlier in the takeoff when given direct aligned reward for doing so.
I think you may be underestimating the degree to which these models are like kindling, and a powerful reinforcement learner could suddenly slurp all of this stuff up and fuck up the world really badly. I personally don’t think a reinforcement learner that is trying to take over the world would be likely to succeed, but the key worry is that we may be able to create a form of life that, like a plague, is not adapted to the limits of its environment, makes use of forms of fast growth that can take over very quickly, and then crashes most of life in the process.
most folks here also assume that such an agent would be able to survive on its own after it killed us, which I think is very unlikely due to how many orders of magnitude more competent you have to be to run the entire world. gpt3 has been able to give me good initial instructions for how to take over the world when pressured to do so (summary: cyberattacks against infrastructure, then threaten people; this is already considered a standard international threat, and is not newly invented by gpt3), but when I then turned around and pressured it to explain why it was a bad idea, it immediately went into detail about how hard it is to run the entire world—obviously these are all generalizations humans have talked about before, but I still think it’s a solid representation of reality.
that said, because such an agent would be likely also misaligned with itself in my view, I think your perspective that humans who are misaligned with each other (ie, have not successfully deconflicted their agency) are a much greater threat to humanity as a whole.
To the extent that reinforcement models could damage the world or become a self-replicating plague, they will do so much earlier in the takeoff when given direct aligned reward for doing so.