Why do you think that the number of people who could make a convincing case to you is so low?
Because I ran the experiment and very few people passed. (The extrapolation from that to an estimate for the world is guesswork.)
Where do they normally mess up?
There’s a lot of different arguments people give, that I dislike for different reasons, but one somewhat common theme was that their argument was not robust to “it seems like InstructGPT is basically doing what its users want when it is capable of it, why not expect scaled up InstructGPT to just continue doing what its users want?”
(And when I explicitly said something like that, they didn’t have a great response.)
Yeah… I suppose you could go through Evan Hubringer’s arguments in “How likely is deceptive alignment?”, but I suppose you’d probably have some further pushback which would be hard to answer.
Because I ran the experiment and very few people passed. (The extrapolation from that to an estimate for the world is guesswork.)
There’s a lot of different arguments people give, that I dislike for different reasons, but one somewhat common theme was that their argument was not robust to “it seems like InstructGPT is basically doing what its users want when it is capable of it, why not expect scaled up InstructGPT to just continue doing what its users want?”
(And when I explicitly said something like that, they didn’t have a great response.)
Yeah… I suppose you could go through Evan Hubringer’s arguments in “How likely is deceptive alignment?”, but I suppose you’d probably have some further pushback which would be hard to answer.