if you try to learn in large chunks, you risk corrupting the external human and then learning corrupted versions of understanding and corrigibility
Why do you think small vs large chunks is the key issue when it comes to corrupting the external human? Can you articulate the chunk size at which you believe things start to become problematic?
There are many reasons you might consider an input “safe,” in the sense that you believe the human’s behavior on that input is benign. In my post I suggested relying on the safety of simple queries, and discussed some of these issues.
Why do you think small vs large chunks is the key issue when it comes to corrupting the external human? Can you articulate the chunk size at which you believe things start to become problematic?
There are many reasons you might consider an input “safe,” in the sense that you believe the human’s behavior on that input is benign. In my post I suggested relying on the safety of simple queries, and discussed some of these issues.