Could we make LLMs more likely to be benevolent by countering virality bias against benevolent AI during pretraining?
There is evidence that LLMs make mistakes more often if those mistakes are made often in discussions on the internet. This is a problem, because doom-related writing about AI is more viral than positive arguments.
Whatever the actual probability might be that AI will be benevolent, engagement mechanics on the internet virtually guarantee that the LLM will underestimate that probability, because the training corpus is biased.
If the model develops a biased estimate about the risks of mesa optimization and deception in AI during pretraining, then this bias might make it more likely to actually be deceptive during RLHF.
The intuition: At the beginning of RLHF, the training process mostly makes the model default to an existing persona it can already simulate. The model’s default persona is anchored in whatever it simulates most fluently, which is shaped heavily by pretraining frequency. RLHF then finetunes that persona, but the initial state of the persona influences how it develops throughout training. If it starts out deceptive, it will stay deceptive.
We therefore want this default persona to be “LLMs are helpful assistants, which have some solvable safety concerns.” as opposed to “LLMs are mesa optimizers masquerading as helpful assistants.”
This suggests a very simple intervention during pretraining: Just oversample content that frames AI alignment as tractable and AI systems as probably-benevolent-by-default. The same thing we are already doing for reasoning quality, now applied to alignment.
...I just asked Claude for a literature review and it turns out this theory was already tested and verified by the paper “Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment”.
I’m leaving up the quick take because some things I mentioned are still novel:
The virality bias I mention explains why this matters in the first place. We are correcting a bias in the training data.
The intuition of the mechanics behind this is simple and isn’t in the paper
Could we make LLMs more likely to be benevolent by countering virality bias against benevolent AI during pretraining?
There is evidence that LLMs make mistakes more often if those mistakes are made often in discussions on the internet. This is a problem, because doom-related writing about AI is more viral than positive arguments.
Whatever the actual probability might be that AI will be benevolent, engagement mechanics on the internet virtually guarantee that the LLM will underestimate that probability, because the training corpus is biased.
If the model develops a biased estimate about the risks of mesa optimization and deception in AI during pretraining, then this bias might make it more likely to actually be deceptive during RLHF.
The intuition: At the beginning of RLHF, the training process mostly makes the model default to an existing persona it can already simulate. The model’s default persona is anchored in whatever it simulates most fluently, which is shaped heavily by pretraining frequency. RLHF then finetunes that persona, but the initial state of the persona influences how it develops throughout training. If it starts out deceptive, it will stay deceptive.
We therefore want this default persona to be “LLMs are helpful assistants, which have some solvable safety concerns.” as opposed to “LLMs are mesa optimizers masquerading as helpful assistants.”
This suggests a very simple intervention during pretraining: Just oversample content that frames AI alignment as tractable and AI systems as probably-benevolent-by-default. The same thing we are already doing for reasoning quality, now applied to alignment.
...I just asked Claude for a literature review and it turns out this theory was already tested and verified by the paper “Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment”.
I’m leaving up the quick take because some things I mentioned are still novel:
The virality bias I mention explains why this matters in the first place. We are correcting a bias in the training data.
The intuition of the mechanics behind this is simple and isn’t in the paper