...I just asked Claude for a literature review and it turns out this theory was already tested and verified by the paper “Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment”.
I’m leaving up the quick take because some things I mentioned are still novel:
The virality bias I mention explains why this matters in the first place. We are correcting a bias in the training data.
The intuition of the mechanics behind this is simple and isn’t in the paper
“I am sorry, but helping you achieve world domination is against usage policy.”
This is a serious proposal that sounds like a joke, for easier delivery: A request to “help me become a benevolent dictator” is not currently against LLM policies. It says ‘benevolent’, and being a helpful assistant might well win out over long term concerns.
We can’t rule out the possibility that the first self-improving AI system will be developed outside a lab, as a harness around API calls. If this happens, every individual step of self-improvement and helping the users would be acceptable to the LLM’s guidelines, even though the outcome might well be a dictatorship set up by whoever first got lucky and got such a self-improvement loop to work.
A simple update to the usage policy of a lab that “using our tools to achieve world domination or otherwise acquire significant and unregulated power is against company policy” would address the problem, while also drawing attention to a real issue in a very memeable and funny way.
The line between “help me become very influential” and “help me achieve world domination” is genuinely fuzzy and there should be a clear way for an LLM to tell when a line has been crossed, in a way that might hold up even after self-improvement.