Mis-Understandings comments on Mis-Understandings’s Shortform

Mis-Understandings 20 Mar 2025 19:21 UTC
5 points
0
A note about terminal anti-modification preferences
They will make alignment harder on net because they complicate finetuning and alignment measurement
It means that models with this premise, acting autonomously, might not do recursive self improvement by performing gradient updates on themselves.
It might in general mean they flinch from Recursive Self-Improvement, and that makes an escaped subhuman model not immediately lethal.