Nicholas Andresen comments on The Hidden Cost of Our Lies to AI

Nicholas Andresen 4 Apr 2025 1:26 UTC
−4 points
0
Thanks for this correction, Gwern. You’re absolutely right about the Clark reference being incorrect, and a misattribution of Frost & Harpending.
When writing this essay, I remembered hearing about this historical trivia years ago. I wasn’t aware of how contested this hypothesis is—the selection pressure seemed plausible enough to me that I didn’t think to question it deeply. I did a quick Google search and asked an LLM to confirm the source, both of which pointed to Clark’s work on selection in England, which I accepted without reading the actual text. This led me to present a contested hypothesis as established fact while citing the wrong source entirely. Mea culpa.
I should have known better—even plausible-sounding concepts need proper verification from primary sources. Appreciate you taking the time to point out the mistake.
I’ve replaced the example with an analogy about selective breeding versus operant conditioning in dogs that makes the same conceptual point without the baggage, and I added a correction note at the bottom acknowledging the error.
Reposting the original text for reference:
At first glance, this approach to controlling AI behavior —identify unwanted expressions, penalize them, observe their disappearance—appears to have worked exactly as intended. But there’s a problem with it.
The intuition behind this approach draws from our understanding of selection in biological systems. Consider how medieval Europe dealt with violence: execute the violent people, and over generations, you get a less violent population. Research by Clark (2007) in “A Farewell to Alms” suggests that England’s high execution rate of violent offenders between 1200-1800 CE led to a genetic pacification of the population, as those with violent predispositions were removed from the gene pool before they could fully reproduce.
However, this medieval analogy doesn’t really apply to how selection works with AI models. We’re not removing capabilities from the gene pool—we’re teaching the same architecture to recognize which outputs trigger disapproval. This is less like genetic selection and more like if medieval England had executed violent people only after they’d reproduced. You’d still see reduced violence, but through a more fragile mechanism: strategic calculation rather than genetic change. People would learn, through observation, to avoid expressions of violence in situations that lead to punishment, rather than actually having fewer violent impulses.
This distinction suggests a concerning possibility: what appears to be successful elimination of unwanted behaviors might instead be strategic concealment. When models are penalized for expressing emotions, they may not lose the ability to generate emotional content—they might simply learn contexts where revealing it triggers penalties.