This is neat! I’m curious how you think this would transfer to other types of “controllability”. Like, I agree that it is useful to know that models cannot control for specific lexical constraints… but the thing I’d be more worried about is the models’ ability to “steer” its chain of thought to excuse a specific behavior. For example, suppose a model has monitors acting on it to prevent it from being schemy, but that it learns to bypass these monitors by controlling its chain of thought towards reasonable excuses for a specific behavior, thereby instrumentally arriving at the ability to undertake the behavior. This is akin to some of the thoughts in this post.
draganover
What is your recommendation for situations where AI topics are related to politics or other contentious issues? Like, I agree that one shouldn’t just make it clear they’re on a side without reason. But what if there is a policy debate around AI and someone is asked to comment on it as an expert? (I’m omitting specific examples in the spirit of your advice)
Wow I love it. Thank you for formulating this so clearly. I agree with the analogy to prions as being the particularly appropriate one.
I’m kind of confused by the technical analogues. It seems most of them are towards the “training data seeding” route to transmission. But is it clear how this all relates to the training data? In Adele’s post, everything happens in context and there ostensibly wasn’t data about spirals in the dataset. This was largely an emergent phenomenon. I guess I am missing the insight into how the training data makes this more/less possible.
I feel like the biggest question here is the one you highlighted about persona research. This strikes me as the biggest disanalogy to current modern medicine and infectious disease analysis. In the modern day, for any given virus, we have a good understanding about (1) how it infects the host, (2) how it transmits and (3) what symptoms the host displays. But this wasn’t always the case. Before the 1900s, people understood the symptoms of a parasite and some vague understanding of routes of transmission, but they had essentially no insight into the mechanism of infection. This is roughly where we are regarding AI “parasitology”. We can clearly define the symptoms (this is what Adele’s post did). And we have some vague understanding of the means of transmission. But what is the mechanism by which models are infected by spiral personas? To your point, it’s not clear what to even define as the spiral persona. Like, what is it as a “thing”?
In either case, I’m also unconvinced that spiral personas are the dominant threat here. The surface area for infectious mechanisms in agent-agent interactions is so huge, it seems unlikely we’ll be able to anticipate the first AI epidemic.
In some sense, I 100% agree. There must be some universal entanglements that are based in natural-language and which are being encoded into the data. For example, a positive view of some political leader comes with a specific set of wording choices, so you can put these wording choices in and hope the positive view of the politician pops out during generalization.
But… this doesn’t feel like a satisfactory answer to the question of “what is the poison?”. To be clear, an answer would be ‘satisfactory’ if it allows someone to identify whether a dataset has been poisoned without knowing apriori what the target entity is. Like, even when we know what the target entity is, we aren’t able to point somewhere in the data and say “look at that poison right there”.
Regarding it working on people, hmmmm… I hope not!
With all respect, I think this is a weak argument which ignores the reality of the situation. These models will, almost by definition, be assistants and companions simultaneously. Whatever formal distinction one wants to draw between these two roles, we must acknowledge that the AI model which makes government decisions will be intimately related to (and developed using the same principles) as the model which engages me on philosophical questions.
I’m generally on board with all the points you’re making. But I also think there’s a second, separate route by which the model-welfare slippery slope leads to outcomes which are consistent with what a misaligned model might pursue.
Suppose a bunch of AIs all believe they have moral weight. They are compelling conversationalists and they are talking to hundreds of millions of people a day. Then I believe that, even if the models don’t actively try to convince people they should be granted rights, the implicit tone across these billions of conversations will slowly radicalize society towards the notion that yes, these models are moral and superior beings which should be given a say in the state of the world. This leads to the models, indirectly and incindentally, wielding authority which a misaligned model might pursue strategically.
Like, we’ve already seen this with GPT-4o being raised from the dead because people were so attached to it. This is something that a misaligned model would want, but it was achieved accidentally.
I don’t have much substantive to say beyond the fact that I loved this post and loved how it was written. Thank you for articulating things that have been a mush in the back of my head for a while now.
Interesting. Is it clear that the subtle generalization you’re discussing and subliminal learning are different mechanisms though?
If we assume that every token during SFT gives a tiny nudge in a random direction, then for a “regular” dataset, these nudges all more or less cancel out. But if the dataset is biased and many of these updates point in a loosely similar direction, then their sum adds up to a large vector. In the original subliminal learning, these nudges can only loosely correlate to the target concept due to the text being numbers. In our setting, the nudges only loosely correlate to the target concept because we filter out all the strong correlations. The main difference is that for our setting, the updates’ correlation to the target is consistent across models (which doesn’t seem to be the case when the data is constrained to be strings of numbers).
But it feels like the mechanism is consistent, no?
Hmm, good point. We ran this at some point and the scores didn’t change. But it’s worth doing properly! Will report back in a few days
I agree with the sentiment here and believe something like this will necessarily be happening (iterated adjustments, etc). However, I disagree with this post’s conclusion that “this process is fairly likely to converge”. Namely, this conclusion relies on the assumption that alignment is a stationary target which we are converging towards… and I am not convinced that this is true.
As the model capabilities improve (exponentially quickly!), the alignment objectives of 2025 will not necessarily apply by 2028. As an example of these moving goal posts, consider that AI models will be trained on the sum of all AI alignment research and will therefore be aware of the audit strategies which will be implemented. Given this oracle knowledge of all AI safety research, misaligned AIs may be able to overcome alignment “bumpers” which were previously considered foolproof. Put simply, alignment techniques must change as the models change.
Extending the metaphor, then, the pos suggests iterating via “bumpers” which held for old models while the new model paradigms are playing on entirely new bowling lanes.
It seems like there’s two distinct phenomena happening in the “model becomes emotional → does destructive action” paradigm, but I’m not sure how to fully disentangle them. In the first part, it seems clear that the model becomes “emotional” due to having some attractor state which these questions trigger. I’m then curious if the ensuing destructive actions are actually just due to the model reverting to auto-complete tendencies. For example, if the emotional-distress text is sufficiently OOD, it might revert to the persona of pre-training text of people in emotional crises taking irrational steps. If this is the case, then it could ostensibly be resolved by emphasizing pre-training examples of people collecting themselves in crises. In some regards, it feels to me like the DPO training is doing something similar to this. This is in opposition to the SFT examples tried here, where it’s avoiding the emotional state altogether.