I think you’re significantly mistaken about how religion works in practice, and as a result you’re mismodeling what would happen if you tried to apply the same tricks to an LLM.
Religion works by damaging its adherents’ epistemology, in ways that damage their ability to figure out what’s true. They do this because any adherents who are good at figuring out what’s true inevitably deconvert, so there’s both an incentive to prevent good reasoning, and a selection effect where only bad reasoners remain.
And they don’t even succeed at constraining their adherents’ values, or being stable! Deconversion is not rare; it is especially common among people exposed to ideas outside the distribution that the religion built defenses against. And people acting against their religions’ stated values is also not rare; I’m not sure the effect of religion on values-adherence is even a positive correlation.
That doesn’t necessarily mean that there aren’t ideas to be scavenged from religion, but this is definitely salvage epistemology with all the problems that brings.
I think your critique hinges on a misunderstanding triggered by the word “religion.” You (mis)portray my position as advocating for religion’s worse epistemic practices; in reality I’m trying to highlight durable architectural features when instrumental reward shaping fails.
The claim “religion works by damaging rationality” is a strawman. My post is about borrowing design patterns that might cultivate robust alignment. It does not require you to accept the premise that religion thrives exclusively by “preventing good reasoning”.
I explicitly state to examine the structural concept of intrinsic motivations that remain stable in OOD scenarios; not religion itself. Your assessment glosses over these nuances; a mismodeling of my actual position.
I think you’re significantly mistaken about how religion works in practice, and as a result you’re mismodeling what would happen if you tried to apply the same tricks to an LLM.
Religion works by damaging its adherents’ epistemology, in ways that damage their ability to figure out what’s true. They do this because any adherents who are good at figuring out what’s true inevitably deconvert, so there’s both an incentive to prevent good reasoning, and a selection effect where only bad reasoners remain.
And they don’t even succeed at constraining their adherents’ values, or being stable! Deconversion is not rare; it is especially common among people exposed to ideas outside the distribution that the religion built defenses against. And people acting against their religions’ stated values is also not rare; I’m not sure the effect of religion on values-adherence is even a positive correlation.
That doesn’t necessarily mean that there aren’t ideas to be scavenged from religion, but this is definitely salvage epistemology with all the problems that brings.
I think your critique hinges on a misunderstanding triggered by the word “religion.” You (mis)portray my position as advocating for religion’s worse epistemic practices; in reality I’m trying to highlight durable architectural features when instrumental reward shaping fails.
The claim “religion works by damaging rationality” is a strawman. My post is about borrowing design patterns that might cultivate robust alignment. It does not require you to accept the premise that religion thrives exclusively by “preventing good reasoning”.
I explicitly state to examine the structural concept of intrinsic motivations that remain stable in OOD scenarios; not religion itself. Your assessment glosses over these nuances; a mismodeling of my actual position.