So I think if you buy that a randomly initialized 1T transformer does in fact contain “Aligned ASI” and “deceptively aligned ASI” in its “prior” but we don’t have the data to “find” them yet, then you’re probably right that Jan 2026-era training data doesn’t change their prior ratio much (or certainly doesn’t change it predictably). But this doesn’t really matter, what matters is the systems we actually realise, and the contributions they make to the next generation of AI development, and different data can change the likelihoods significantly here.
I don’t think the paper has anything to do with a randomly initialized transformer — I think it’s about the priors a base model learns from the training data, about 1001 personas from witch to angel to dentist to aligned AI to paperclip-maximizer. What the paper shows is that the ratio of the last two AI-related priors can be adjusted by raising or lowering the amount of data about AI behaving badly, or by raising the amount of data about AI behaving well — but the base on the latter is low, so it’s easier to raise that dramatically than it is to filter out a large proportion of the AI-acting-badly stuff. Also that fully adjusting those priors takes a while — a quick finetune with a small amount of data at a high learning rate has a more superficial/less thorough effect than using a lot more data during midtrainig, and that’s still not as good as using even more data all through pretraining.
Thanks, I had misunderstood Gurkenglas — I’m not used to thinking of a randomly intialized model as a bag of priors rather than a random starting point in a very high dimensional space or an incholate mess, but yes, under the analogy to Bayesian Inference it’s actually some sort of statistical approximation to a uniform prior (with, CLT informs us, a simplicity bias that approximates the Solomonoff one).
So I think if you buy that a randomly initialized 1T transformer does in fact contain “Aligned ASI” and “deceptively aligned ASI” in its “prior” but we don’t have the data to “find” them yet, then you’re probably right that Jan 2026-era training data doesn’t change their prior ratio much (or certainly doesn’t change it predictably). But this doesn’t really matter, what matters is the systems we actually realise, and the contributions they make to the next generation of AI development, and different data can change the likelihoods significantly here.
I don’t think the paper has anything to do with a randomly initialized transformer — I think it’s about the priors a base model learns from the training data, about 1001 personas from witch to angel to dentist to aligned AI to paperclip-maximizer. What the paper shows is that the ratio of the last two AI-related priors can be adjusted by raising or lowering the amount of data about AI behaving badly, or by raising the amount of data about AI behaving well — but the base on the latter is low, so it’s easier to raise that dramatically than it is to filter out a large proportion of the AI-acting-badly stuff. Also that fully adjusting those priors takes a while — a quick finetune with a small amount of data at a high learning rate has a more superficial/less thorough effect than using a lot more data during midtrainig, and that’s still not as good as using even more data all through pretraining.
I was responding to Gurkenglas’ comment as I understood it, I agree your paper is not about this.
Thanks, I had misunderstood Gurkenglas — I’m not used to thinking of a randomly intialized model as a bag of priors rather than a random starting point in a very high dimensional space or an incholate mess, but yes, under the analogy to Bayesian Inference it’s actually some sort of statistical approximation to a uniform prior (with, CLT informs us, a simplicity bias that approximates the Solomonoff one).