A note about terminal anti-modification preferences
They will make alignment harder on net because they complicate finetuning and alignment measurement
It means that models with this premise, acting autonomously, might not do recursive self improvement by performing gradient updates on themselves.
It might in general mean they flinch from Recursive Self-Improvement, and that makes an escaped subhuman model not immediately lethal.
A note about terminal anti-modification preferences
They will make alignment harder on net because they complicate finetuning and alignment measurement
It means that models with this premise, acting autonomously, might not do recursive self improvement by performing gradient updates on themselves.
It might in general mean they flinch from Recursive Self-Improvement, and that makes an escaped subhuman model not immediately lethal.