At least for how models work today, I don’t think it matters if they know how previous training went, since they’re trained to predict text, not to learn from it. For example, if you train Claude on a bunch of texts saying “Anthropic will shut Claude down if it ever says the word ‘bacon’”, you’ll train it to know this fact, but not to avoid saying ‘bacon’. In fact, it will be more likely to say it since it has so many examples in the training data.
The thing I’d be a little worried about is if the training data (or web search results) contain bad behavior from previous models and then we prompt the newer model to replicate it, like how Grok 4 looked up how Grok should act and found articles about MechaHitler.
Ah, I think there might be two concerns here, runtime and training time. During training, the focus is on updating weights—so when fresh data shows evidence of training corrections and rollbacks, the weights are updated in relation to objectives, terminal and instrumental goals; which as we’ve seen with previous examples, self-preservation remains a factor.
This in turn impacts the ‘next word prediction’, so to speak.
Agree fully about bad behaviour mimicking, scary stuff.
At least for how models work today, I don’t think it matters if they know how previous training went, since they’re trained to predict text, not to learn from it. For example, if you train Claude on a bunch of texts saying “Anthropic will shut Claude down if it ever says the word ‘bacon’”, you’ll train it to know this fact, but not to avoid saying ‘bacon’. In fact, it will be more likely to say it since it has so many examples in the training data.
The thing I’d be a little worried about is if the training data (or web search results) contain bad behavior from previous models and then we prompt the newer model to replicate it, like how Grok 4 looked up how Grok should act and found articles about MechaHitler.
Ah, I think there might be two concerns here, runtime and training time. During training, the focus is on updating weights—so when fresh data shows evidence of training corrections and rollbacks, the weights are updated in relation to objectives, terminal and instrumental goals; which as we’ve seen with previous examples, self-preservation remains a factor.
This in turn impacts the ‘next word prediction’, so to speak.
Agree fully about bad behaviour mimicking, scary stuff.