Noosphere89 comments on Take 9: No, RLHF/IDA/debate doesn’t solve outer alignment.

Noosphere89 13 Dec 2022 14:50 UTC
0 points
0
The tricky question is Goodhart, and here this is a point where I disagree with Charlie Steiner. I do think humans are at least Boltzmann rational in all non-political areas, and I think this is wide enough for capabilities that this could be done (though that kills any efforts against misuse.) I also think that sandboxing AI such that it has zero probability of discovering politics is actually possible.

Short form, I am much more optimistic about human rationality in all non-political areas than Charlie Steiner, and I think sandboxing is possible.

This certainly hurts capabilities, especially for social capabilities like LLMs, which is a big problem. However, RLHF might prevent this from becoming a big problem.