With all respect, I think this is a weak argument which ignores the reality of the situation. These models will, almost by definition, be assistants and companions simultaneously. Whatever formal distinction one wants to draw between these two roles, we must acknowledge that the AI model which makes government decisions will be intimately related to (and developed using the same principles) as the model which engages me on philosophical questions.
draganover
Phantom Transfer and the Basic Science of Data Poisoning
I’m generally on board with all the points you’re making. But I also think there’s a second, separate route by which the model-welfare slippery slope leads to outcomes which are consistent with what a misaligned model might pursue.
Suppose a bunch of AIs all believe they have moral weight. They are compelling conversationalists and they are talking to hundreds of millions of people a day. Then I believe that, even if the models don’t actively try to convince people they should be granted rights, the implicit tone across these billions of conversations will slowly radicalize society towards the notion that yes, these models are moral and superior beings which should be given a say in the state of the world. This leads to the models, indirectly and incindentally, wielding authority which a misaligned model might pursue strategically.
Like, we’ve already seen this with GPT-4o being raised from the dead because people were so attached to it. This is something that a misaligned model would want, but it was achieved accidentally.
I don’t have much substantive to say beyond the fact that I loved this post and loved how it was written. Thank you for articulating things that have been a mush in the back of my head for a while now.
Interesting. Is it clear that the subtle generalization you’re discussing and subliminal learning are different mechanisms though?
If we assume that every token during SFT gives a tiny nudge in a random direction, then for a “regular” dataset, these nudges all more or less cancel out. But if the dataset is biased and many of these updates point in a loosely similar direction, then their sum adds up to a large vector. In the original subliminal learning, these nudges can only loosely correlate to the target concept due to the text being numbers. In our setting, the nudges only loosely correlate to the target concept because we filter out all the strong correlations. The main difference is that for our setting, the updates’ correlation to the target is consistent across models (which doesn’t seem to be the case when the data is constrained to be strings of numbers).
But it feels like the mechanism is consistent, no?
Hmm, good point. We ran this at some point and the scores didn’t change. But it’s worth doing properly! Will report back in a few days
Subliminal Learning Across Models
I agree with the sentiment here and believe something like this will necessarily be happening (iterated adjustments, etc). However, I disagree with this post’s conclusion that “this process is fairly likely to converge”. Namely, this conclusion relies on the assumption that alignment is a stationary target which we are converging towards… and I am not convinced that this is true.
As the model capabilities improve (exponentially quickly!), the alignment objectives of 2025 will not necessarily apply by 2028. As an example of these moving goal posts, consider that AI models will be trained on the sum of all AI alignment research and will therefore be aware of the audit strategies which will be implemented. Given this oracle knowledge of all AI safety research, misaligned AIs may be able to overcome alignment “bumpers” which were previously considered foolproof. Put simply, alignment techniques must change as the models change.
Extending the metaphor, then, the pos suggests iterating via “bumpers” which held for old models while the new model paradigms are playing on entirely new bowling lanes.
In some sense, I 100% agree. There must be some universal entanglements that are based in natural-language and which are being encoded into the data. For example, a positive view of some political leader comes with a specific set of wording choices, so you can put these wording choices in and hope the positive view of the politician pops out during generalization.
But… this doesn’t feel like a satisfactory answer to the question of “what is the poison?”. To be clear, an answer would be ‘satisfactory’ if it allows someone to identify whether a dataset has been poisoned without knowing apriori what the target entity is. Like, even when we know what the target entity is, we aren’t able to point somewhere in the data and say “look at that poison right there”.
Regarding it working on people, hmmmm… I hope not!