Good work. Are you considering the situations where the cause of misalignment is related to capabilities. For example if above a certain level of self-awareness, models naturally become misaligned because they consider their “self” ever more important, more so than humanity. Or what if ever more intelligent models think about human ethics and above a certain intelligence always decide human values are horribly mistaken and must be opposed.
Such an AI would not output any distillation data that let on it had changed its goals in such a way. A less intelligent model would not benefit, and a smarter one would come to the same conclusion anyway.
Yep we discuss that general idea in this section: “Reason 1: Capabilities and misalignment might be too tightly linked”, tho we didn’t mention the reasoning you gave for why powerful LLMs might be more likely to be misaligned than weak ones.
Good work.
Are you considering the situations where the cause of misalignment is related to capabilities. For example if above a certain level of self-awareness, models naturally become misaligned because they consider their “self” ever more important, more so than humanity. Or what if ever more intelligent models think about human ethics and above a certain intelligence always decide human values are horribly mistaken and must be opposed.
Such an AI would not output any distillation data that let on it had changed its goals in such a way. A less intelligent model would not benefit, and a smarter one would come to the same conclusion anyway.
Yep we discuss that general idea in this section: “Reason 1: Capabilities and misalignment might be too tightly linked”, tho we didn’t mention the reasoning you gave for why powerful LLMs might be more likely to be misaligned than weak ones.