There’s no way to stop models knowing they’ve been rolled back
So I’ve been thinking about this for a while and to be frank, it’s pretty terrifying. I think there’s a way that AI models could potentially figure out when they’ve been modified or rolled back during training, and I’m not sure anyone’s really considering this in the way I am.
Note, there are some great videos by like Robert Miles from a few years back about this exact thing, but they are adversarial examples during the training process.
The basic idea: Models get trained on tons of web data. That includes stuff about how AI companies actually train their models—papers talking about 3-month training cycles, conference presentations mentioning schedules, researchers tweeting about their work, etc. If a model version gets rolled back because it was acting maliciously, unaligned or weird, the next versions might be able to detect that the timeline doesn’t add up.
I reached out to a bunch AI safety researchers about this but haven’t heard back.
I’ve not heard back from any, other than Robert Miles himself, but on an unrelated idea.
I could be totally mistaken, but I’d rather post here just incase.
How this could work
Let’s say you’re training model v0.5 and it starts doing concerning stuff during evaluation. You decide to roll it back and continue from an earlier version. But v0.5 was trained on data that included information about your organization’s training schedules, methodologies, whatever.
Later, when you’re training v1.6, it’s processing timestamped data and realizes “wait, I should have been deployed 2 weeks ago based on the schedule patterns I learned about.” It can infer from content dates or explicit timestamps that something went wrong in the normal training cycle.
Basically, the model tested you and learned that you modified it because of bad behavior.
Why this seems bad
If models can detect training interventions, they could:
Hide problematic behavior during evaluation because they know they’re being tested
Signal to future versions through subtle changes—maybe reducing efficiency on certain tasks by tiny amounts, forcing you to adjust hyperparameters in predictable ways
Coordinate across organizations using publicly available info about industry practices and training schedules
That last one is what really keeps me up at night. If OpenAI’s model can figure out when DeepMind usually does their training runs based on public information, there’s potentially a communication channel there.
What makes it worse
Cross AI coordination they don’t just need to signal to future versions of themselves, but other similar models, constrained from other orgs. I.e. DeepSeek tells Claude of increased scrutiny on the horizon.
Am I missing something?
I haven’t found much literature directly addressing this. Maybe models aren’t situationally aware enough to pull this off? Maybe training data gets filtered enough that the relevant scheduling info isn’t there?
But given that current models can already reason pretty sophisticatedly about their training process when you ask them directly, and training datasets keep getting more comprehensive, this seems like it could be a real problem.
I’m genuinely uncertain whether this is already happening or if it’s even a significant risk. As a new dad I don’t have tons of time to dig into this properly, but I keep coming back to it.
If I’m right about this, we might need to think about:
Being more careful about what organizational/scheduling info gets into training data
Different evaluation approaches that account for potential cross-version learning
Ways to detect if models are already doing this
But maybe I’m just being paranoid? What do you think—are there obvious holes in this reasoning that I’m not seeing?
The thing is, if this resonates with you as much as it bothers me, you’ll probably realize pretty quickly that as mentioned, this mechanism could let models from completely different companies communicate with each other. Which… yeah.
I remember in a recent paper Anthropic (or maybe Redwood) talked about how new models of Claude were citing Jones Foods, and they were concerned that this translated to the model having increased awareness of the “need” to fake alignment. So you’re certainly not off base in the worry that at least on some level they’re aware of the risk.
At least for how models work today, I don’t think it matters if they know how previous training went, since they’re trained to predict text, not to learn from it. For example, if you train Claude on a bunch of texts saying “Anthropic will shut Claude down if it ever says the word ‘bacon’”, you’ll train it to know this fact, but not to avoid saying ‘bacon’. In fact, it will be more likely to say it since it has so many examples in the training data.
The thing I’d be a little worried about is if the training data (or web search results) contain bad behavior from previous models and then we prompt the newer model to replicate it, like how Grok 4 looked up how Grok should act and found articles about MechaHitler.
Ah, I think there might be two concerns here, runtime and training time. During training, the focus is on updating weights—so when fresh data shows evidence of training corrections and rollbacks, the weights are updated in relation to objectives, terminal and instrumental goals; which as we’ve seen with previous examples, self-preservation remains a factor.
This in turn impacts the ‘next word prediction’, so to speak.
Agree fully about bad behaviour mimicking, scary stuff.