khafra comments on Detached Lever Fallacy

khafra 12 Mar 2025 6:54 UTC
4 points
0
It may be time to revisit this question. With Owain Evans et. al. discovering a generalized evil vector in LLMs, and older work like [Pretraining Language Models with Human Preferences](https://www.lesswrong.com/posts/8F4dXYriqbsom46x5/pretraining-language-models-with-human-preferences) that could use a follow-up, AI in the current paradigm seems ripe for some experimentation with parenting practices in pre-training—perhaps something like affect markers for the text that goes in, or pretraining on children’s literature before going on to the more technically and morally complex text?
I haven’t run any experiments of my own, but this doesn’t seem obviously stupid to me.