Chris Lakin
I’ve seen this stem from need for validation (insecurity) instead of AIs themselves. A guy I know naturally stopped using ChatGPT 4h/day after his romantic anxiety disappeared. This makes the boundaries of the problem difficult. What are you going to do?— turn every AI into the best therapist? make every AI detect when the human is posting from insecurity and stop?
Humans are not automatically strategic — “inner work” edition
isn’t this true of humans too?
Have you seen Most “inner work” is not optimized for results?
Note: Sid is anonymous.
Most “inner work” looks like entertainment.
https://www.hyperstitionai.com
Aaron Silverbrook, today:
Hyperstition AI has shipped two open-source corpora (5,000 novels + 40,000 short stories, ~500M tokens), developed the generation pipelines, and collaborated with Geodesic Research on the experiments validating their alignment pretraining. https://alignmentpretraining.ai/
Our second corpus was to test Turntrout’s hypothesis for a filtered data set eliding all mention of AI and instead trying to convince the model that it was some kind of benevolent glass angel being https://turntrout.com/self-fulfilling-misalignment
Here’s both open-source corpora https://huggingface.co/datasets/jayterwahl/hyperstition
Thanks for sharing, surprised I haven’t seen more posts like this
the implementation took writers, copyeditors, web developers, backend developers, UX designers, a medical doctor whose patients were among our first users, and many more.
how much time do you think it would take to have made all of microcovid (including the research) today?
https://x.com/tomekkorbak/status/2038704753887379891
New OpenAI post: Can midtraining on docs about aligned AI bake in alignment priors for agents? We report an experiment where those priors are quickly washed away by RL and fail to generalize to agentic settings. But that cuts both ways: priors that AIs are misaligned fade too!
https://alignment.openai.com/how-far-does-alignment-midtraining-generalize/
(more in Richard’s thread)
in text form:
Richard Ngo @RichardMCNgo
The world’s most consequential Molochian dynamic is the race between AGI companies.
But several of the leading companies were significantly influenced by @slatestarcodex (e.g. as below)
So perhaps we should think of Scott as having *summoned* Moloch.
Quote
the blog post I think about most often: https://slatestarcodex.com/2014/07/30/meditations-on-moloch/
(other than a bunch of PG essays, which I somehow don’t count as blog posts)
It’s definitely not the most unlearning-ish algorithm there could be, but targeting unwanted responses directly is closer than not doing it
SFT on calm response data was ineffective. We trained for 2 epochs on 650 calm responses mixed with 500 samples of standard instruct data. In one iteration (SFT teacher, described in the paper) this actually marginally increased expressed distress, seemingly as a result of making responses much more verbose.
DPO on 280 preference pairs was highly effective.
This matches what I’ve observed in people. Approaches that rhyme with “imitate calm behavior” tend to be flaky if not ineffective. The approaches that last tend to instead be about unlearning insecure feelings/behaviors
However, we emphasize that post-hoc emotional suppression is a problematic strategy. In more capable models, training against emotional outputs risks hiding the expression without addressing whatever underlying state is driving it.
yup this is how it works in humans!
If you were put in this situation again, would you do anything differently?
Can you moderate the promotion of enmity without escalating social violence?
If anyone can recall successful instances of this happening in AI, could you reply here with links? Would love to share and celebrate the role models
A frontier lab researcher I worked with used to disagree with colleagues but say nothing. Two conversations later he was holding ground with people he looked up to. A year and a half later he still pushes back. I hunt bounties like this that improve AI alignment, DM me