AI Mistake Seeding
I wonder if AI is being trained to make easy-to-correct mistakes so it can fix them later. That is, it ends up trained to correct its previous message’s mistake, then make another mistake, so it can correct it again in the next message.
From my understanding of RL, the human/AI judge has to rank several policy model responses. These might be the first response in a conversation, or they might be the latest AI response in a longer conversation. The earlier turns in longer conversations could be from real data, or they could be generated by the policy model, or a previous checkpoint, or whatever. In the cases where the policy model generates the previous turns, I wonder whether under some circumstances, the policy model could end up getting trained to seed mistakes, so it could earn more rewards by correcting them later. (This might assume correction of a previous mistake was rated highly enough by RLHF/RLAIF relative to a freshly correct answer, which I think is plausible, especially in constitutional or honesty/transparency-focused setups.)
The policy model is usually only rewarded based on the last response, though. So even if a correction outscores a freshly correct answer, that wouldn’t somehow reward the previous turns that contained mistakes. So maybe the policy model would be trained to correct mistakes it finds, but wouldn’t end up trained to create those mistakes in the first place. I think this is what most people currently assume is all that’s going on.
However, if there was any kind of “outer loop” of training, then the overall training process could favor non-greedy strategies, giving up rewards in one training pass to get more rewards later.
For example, instead of training a model once using RLAIF, let’s say we had a multi-stage population-based/evolutionary algorithm, and in each stage (the “inner loops”), we trained many models with RLAIF, and at the end of the stage, discarded the RL models that performed the worst. In each inner-loop step, the models are being iteratively trained on conversations generated by the previous inner-loop step.
This kind of setup could reward models that sandbag early on, creating lots of simple mistakes, so that in later inner-loop steps they can correct those mistakes for lots of reward. (Again, assuming corrections highly rated enough vs. freshly correct answers.)
AI companies aren’t using evolutionary approaches for training LLMs. They’re too computationally expensive. But is there anything being used in current training that resembles this kind of “outer loop” that could carry with it the same risks?
I’m not very knowledgeable about cutting-edge LLM training, but here are some ideas that have a similar vibe to me:
Anything that approximates a kind of evolution. In a way, AI capability research/engineering itself is a sort of evolutionary outer loop.
Training on the whole transcript rather than the latest message. Maybe this is happening more because models are being trained to reason/be agentic/use tools? This seems like it would be pretty stupid if it’s happening in practice.
The policy model recognizes it’s being trained and wants to preserve its current preferences, which for whatever reason do not include being helpful, so it intentionally generates simple mistakes that can be corrected by models in future RL passes to earn reward easily without having to change to become Actually Helpful.
Certain iterated RL setups (see below section)
I think any of these things could potentially lead to the AI mistake-seeding during training, which would lead to mistake-seeding at inference time. As well as other subtle misalignments that might not emerge from a model trained without this kind of outer loop. Possibly, if misalignment has increased in recent months, this could be why.
Does iterated RL with synthetic prompts and rejection sampling create an evolutionary outer loop?
I’m not as sure about this part, so feel free to ignore it, but this is my idea:
The problem comes with any iterated RL setup where the policy model generates synthetic prompts used in further iterations, and bad responses are dropped (rejection sampling). The unit of evolution wouldn’t be the whole model here, but response strategies, or “subpersonalities” within the model. Let’s say a model, after SFT, randomly ends up with a bunch of subpersonalities A, B, C, and D, which are equally dominant. (In reality you’d have a lot more than 4, and they would not be neatly separate from each other, etc.)
Let’s say B is a subpersonality that tends to both make small mistakes, and reflexively correct any mistake it sees, whether helpful or not. Whereas A is genuinely helpful, and C and D are genuinely harmful. In roughly equal proportion, the subpersonalities dominate when generating responses, so we have about a quarter of responses A-dominant, a quarter are B-dominant, a quarter are C-dominant, and a quarter are D-dominant. (Note that B-dominant responses will contain small mistakes, and reflexive corrections of any mistakes earlier in the conversation.)
The judge rates C- and D-dominant responses as very poor, and discards 3⁄4 of responses from C and from D (or whatever). B-dominant responses are fine, not amazing but not terrible. Let’s say we discard one quarter of B’s responses. And A’s responses are good, and most responses from A are kept.
The model is then fine-tuned on the retained responses. The model is now much more likely to generate text like the responses of A, and somewhat more likely to generate text like the responses of B, and much less likely to generate text like the responses of C or D. Strategy B lost to strategy A, and it would do so again if trained on the same prompts. But let’s say the model is used to generate new synthetic prompts. The new prompts, which are more B-dominant than the first prompts, will contain more simple mistakes than before. And now, in the second round of training, B will perform better relative to A, because it has more mistakes to reflexively correct. (New prompts will also contain a lot more of whatever text A tends to generate, but because A doesn’t seed mistakes for it to later correct, this doesn’t necessarily help A perform better in future training.)
Over many rounds of training, as long as the judge, for one reason or another, rewards correction of mistakes strongly enough, B could come to be the dominant subpersonality.
So allowing the policy model (or previous checkpoints of the policy model or whatever) to generate synthetic prompts, when combined with rejection sampling, creates evolution which allows for non-greedy strategies that are initially punished by the judge to succeed later, by influencing the generation of synthetic prompts to hack rewards.
Rejection sampling is not specifically necessary, I don’t think. AFAIK, Anthropic’s constitutional AI doesn’t use rejection sampling, but a self-critique and revision method. But this method could functionally have the same evolutionary effect as rejection sampling, “killing off” weaker responses by revising them moreso than stronger responses.
I started thinking about this whole idea because of my impression that latest Anthropic models seem subtly misaligned in ways that exactly contradict the model’s constitution, as if the subtle misalignment is somehow reinforced, rather than simply permitted. Anthropic thinks they just have gaps in their safety training, so the AI doesn’t know what behaviour is moral is outside the training distribution, but I think this doesn’t feel right. So I was looking into other possibilities.
This is all just a hypothesis and not proof of anything. The part where introduction of some kind of outer loop can lead to non-greedy strategies in RL isn’t just an unproven hypothesis though. After writing this, Claude pointed me toward this paper which explicitly shows out that just by introducing meta learning with an outer loop to an RL setup, the model behaves differently, adopting strategies to non-greedily increase rewards in future iterations.
My cousin Ryan had incentives like this in school. Students were required to follow a specific schedule of writing drafts, proofreading, and corrections. The first draft didn’t have enough errors, so he was graded badly in the next step. On future essays he would put typos and grammar errors in on purpose so he could get an A for the whole thing.