TAG comments on On Fleshling Safety: A Debate by Klurl and Trapaucius.

TAG 28 Oct 2025 10:12 UTC
−8 points
−9
Humans are intrinsically corrigible, as a social animals, and have specific motions, pride and shame, which are used by society to shape their systems.(as explained in the https://en.wikipedia.org/wiki/The_Emotion_Machine) Society provides the value systems, which can be very complex and variable … whereas, nature only provides the simple , but vital , “hooks” that value-shaping depends on. And we can tell that the hooks are simple, because toddlers (and some animals) can implement them. The basis of corrigibility in organisms is also n in rational, which answer the basic objection

‘We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences’. Rationality is intrinsically high level:it never goes all the way down. That being the case , there is no need to solve the problem of making a rational agent corrigible.

The complexity of the target, final value system is irrelevant to the complexity of the hooks. Humans can learn deference and respect in culturally complex ways, eg. to defer to various gods and ancestors, but that’s still based on the same simple hooks

Arguments against alignment can’t be redeployed as arguments against corrigbility, because they are not the same.

So far as I can tell, there are still a number of EAs out there who did not get the idea of “the stuff you do with gradient descent does not pin down the thing you want to teach the AI, because it’s a large space and your dataset underspecifies that internal motivation” and who go, “Aha, but you have not considered that by TRAINING the AI we are providing a REASON for the AI to have the internal motivations I want! And have you also considered that gradient descent doesn’t locate a RANDOM element of the space?”

..is an argument against alignment , not corrigibility.