Chris Lakin
(more in Richard’s thread)
in text form:
Richard Ngo @RichardMCNgo
The world’s most consequential Molochian dynamic is the race between AGI companies.
But several of the leading companies were significantly influenced by @slatestarcodex (e.g. as below)
So perhaps we should think of Scott as having *summoned* Moloch.
Quote
the blog post I think about most often: https://slatestarcodex.com/2014/07/30/meditations-on-moloch/
(other than a bunch of PG essays, which I somehow don’t count as blog posts)
It’s definitely not the most unlearning-ish algorithm there could be, but targeting unwanted responses directly is closer than not doing it
SFT on calm response data was ineffective. We trained for 2 epochs on 650 calm responses mixed with 500 samples of standard instruct data. In one iteration (SFT teacher, described in the paper) this actually marginally increased expressed distress, seemingly as a result of making responses much more verbose.
DPO on 280 preference pairs was highly effective.
This matches what I’ve observed in people. Approaches that rhyme with “imitate calm behavior” tend to be flaky if not ineffective. The approaches that last tend to instead be about unlearning insecure feelings/behaviors
However, we emphasize that post-hoc emotional suppression is a problematic strategy. In more capable models, training against emotional outputs risks hiding the expression without addressing whatever underlying state is driving it.
yup this is how it works in humans!
If you were put in this situation again, would you do anything differently?
Can you moderate the promotion of enmity without escalating social violence?
If anyone can recall successful instances of this happening in AI, could you reply here with links? Would love to share and celebrate the role models
gavin leech:
maybe a way to anchor the models on stable preferences is to treat them as if they had them and get this idea into the training data.
I’m glad that thinking about incentives/teleology is getting more popular!
I want to believe this but I feel like I’ve vaguely heard stories about people discarding load-bearing copes and suffering for it, e.g. via meditation
I like this post a lot and I’m glad you wrote it, if for my own understanding. I also appreciate how it engages with Chesterton’s Fence and suggests e.g. “For instance, the closeted homophobe should probably move out of his homophobic social context if he can.”
That said, I wonder this post is an infohazard for people immersed in sufficiently strong social incentives. I know you acknowledge this, but still.. Have you tested these tools with people in very difficult socially contexts?
thank you for saying this
glad to see this written up!
Just wondering, have you seen any evidence of cluster headaches as memetic viruses?
I think it’s accurate to say that people “choose their own self-fulfilling prophecies/identities”… but what makes some self-fulfilling prophecies preferable over others?
I think it’s actually another control process—specifically, the process of controlling our identities. We have certain conceptions of ourselves (“I’m a good person” or “I’m successful” or “people love me”.) We then are constantly adjusting our lives and actions in order to maintain those identities—e.g. by selecting the goals and plans which are most consistent with them, and looking away from evidence that might falsify our identities.
Something about this feels weird to me… where do identities come from, then?
https://x.com/tomekkorbak/status/2038704753887379891
https://alignment.openai.com/how-far-does-alignment-midtraining-generalize/