Aside from some quibbles, this matches my understanding pretty well, but may leave the reader wondering why Paul Christiano and Ought decided to move away from imitative amplification to approval-based amplification. To try to summarize my understanding of their thinking (mostly from an email conversation in September of last year between me, you (Evan), Paul Christiano, and William Saunders):
William (and presumably Paul) think approval-based amplification can also be outer aligned. (I do not a good understand why they think this, and William said “still have an IOU pending to provide a more fleshed out argument why it won’t fail.”)
Paul thinks imitative amplification has a big problem when the overseer gets amplified beyond the capacity of the model class that’s being trained. (Approximating HCH as closely as possible wouldn’t lead to good results in that case unless we had a rather sophisticated notion of “close”.)
I replied that we could do research into how the overseer could effectively dumb itself down, similar to how a teacher would dumb themselves down to teach a child. One approach is to use a trial-and-error process, for example ramping up the difficulty of what it’s trying to teach and then backing down if the model stops learning well, and trying a different way of improving task performance and checking if the model can learn that, and so on. (I didn’t get a reply on this point.)
William also wrote, “RL-IA is easier to run human experiments in, because the size of trees to complete tasks, and the access to human experts with full knowledge of the tree (eg the Ought reading comprehension experiments) I’d lean towards taking the position that we should try to use SL-IA where possible, but some tasks might just be much easier to work with in RL-AI”
Aside from some quibbles, this matches my understanding pretty well, but may leave the reader wondering why Paul Christiano and Ought decided to move away from imitative amplification to approval-based amplification. To try to summarize my understanding of their thinking (mostly from an email conversation in September of last year between me, you (Evan), Paul Christiano, and William Saunders):
William (and presumably Paul) think approval-based amplification can also be outer aligned. (I do not a good understand why they think this, and William said “still have an IOU pending to provide a more fleshed out argument why it won’t fail.”)
Paul thinks imitative amplification has a big problem when the overseer gets amplified beyond the capacity of the model class that’s being trained. (Approximating HCH as closely as possible wouldn’t lead to good results in that case unless we had a rather sophisticated notion of “close”.)
I replied that we could do research into how the overseer could effectively dumb itself down, similar to how a teacher would dumb themselves down to teach a child. One approach is to use a trial-and-error process, for example ramping up the difficulty of what it’s trying to teach and then backing down if the model stops learning well, and trying a different way of improving task performance and checking if the model can learn that, and so on. (I didn’t get a reply on this point.)
William also wrote, “RL-IA is easier to run human experiments in, because the size of trees to complete tasks, and the access to human experts with full knowledge of the tree (eg the Ought reading comprehension experiments) I’d lean towards taking the position that we should try to use SL-IA where possible, but some tasks might just be much easier to work with in RL-AI”