See also my previous comment for my understanding of what corrigibility means here and the motivation for wanting to do AI alignment through corrigibility learning instead of value learning.
...
The conclusion here seems to be that corrigibility can’t be learned safely, at least not in a way that’s clear to me.
1) Are you more comfortable with value learning, or do both seem unsafe at present?
2) If we had a way to deal with this particular objection (where, as I understand it, subagents are either too dumb to be sophisticatedly corrigible, or are smart enough to be susceptible to attacks), would you be significantly more hopeful about corrigibility learning? Would it be your preferred approach?
1) Are you more comfortable with value learning, or do both seem unsafe at present?
2) If we had a way to deal with this particular objection (where, as I understand it, subagents are either too dumb to be sophisticatedly corrigible, or are smart enough to be susceptible to attacks), would you be significantly more hopeful about corrigibility learning? Would it be your preferred approach?