A new post, looking into the strong version of corrigibility, arguing that it doesn’t make sense without a full understanding of human values (and that with that understanding, it’s redundant). Relevant to Amplification/Distillation since corrigibility is one of the aims of that framework.
A new post, looking into the strong version of corrigibility, arguing that it doesn’t make sense without a full understanding of human values (and that with that understanding, it’s redundant). Relevant to Amplification/Distillation since corrigibility is one of the aims of that framework.
https://www.lesswrong.com/posts/T5ZyNq3fzN59aQG5y/the-limits-of-corrigibility