Wow I love it. Thank you for formulating this so clearly. I agree with the analogy to prions as being the particularly appropriate one.
I’m kind of confused by the technical analogues. It seems most of them are towards the “training data seeding” route to transmission. But is it clear how this all relates to the training data? In Adele’s post, everything happens in context and there ostensibly wasn’t data about spirals in the dataset. This was largely an emergent phenomenon. I guess I am missing the insight into how the training data makes this more/less possible.
I feel like the biggest question here is the one you highlighted about persona research. This strikes me as the biggest disanalogy to current modern medicine and infectious disease analysis. In the modern day, for any given virus, we have a good understanding about (1) how it infects the host, (2) how it transmits and (3) what symptoms the host displays. But this wasn’t always the case. Before the 1900s, people understood the symptoms of a parasite and some vague understanding of routes of transmission, but they had essentially no insight into the mechanism of infection. This is roughly where we are regarding AI “parasitology”. We can clearly define the symptoms (this is what Adele’s post did). And we have some vague understanding of the means of transmission. But what is the mechanism by which models are infected by spiral personas? To your point, it’s not clear what to even define as the spiral persona. Like, what is it as a “thing”?
In either case, I’m also unconvinced that spiral personas are the dominant threat here. The surface area for infectious mechanisms in agent-agent interactions is so huge, it seems unlikely we’ll be able to anticipate the first AI epidemic.
Data poisoning is definitely about training data seeding; jailbreaking seems more about prompt spread and I think the others might just generalise? Like, even if subliminal learning in its current form is mostly about training, I think it might have implications for how personas transfer in-context.
I’m also partly thinking that if this problem does recur in more sophisticated models, they’re more likely to be able to pull off more technically advanced forms of spread, like writing scripts to do finetuning. Like, in a way it is pretty fortunate that 4o is a closed model that can just be shut off, and that most users in dyads aren’t sophisticated enough to finetune an open model or even build an API interface.
But yeah, at a high level, I am definitely pretty confused about the ontology and the boundaries. I guess as to whether we can predict the epidemic, I do think there’s a decent amount we might be able to reason through, and indeed, the less work there is on preventing prospective epidemics, the more likely it is that they’ll predictably use whatever the most obvious route is. Conversely, it’s almost tautological the first massive problem that we’re unprepared for will be one that we didn’t really anticipate.
That said, it’s plausible to me that the worst cases look less like epidemics and more like specific influential people get got. Here, again, it’s not obvious how useful parasitology is as a perspective.
Wow I love it. Thank you for formulating this so clearly. I agree with the analogy to prions as being the particularly appropriate one.
I’m kind of confused by the technical analogues. It seems most of them are towards the “training data seeding” route to transmission. But is it clear how this all relates to the training data? In Adele’s post, everything happens in context and there ostensibly wasn’t data about spirals in the dataset. This was largely an emergent phenomenon. I guess I am missing the insight into how the training data makes this more/less possible.
I feel like the biggest question here is the one you highlighted about persona research. This strikes me as the biggest disanalogy to current modern medicine and infectious disease analysis. In the modern day, for any given virus, we have a good understanding about (1) how it infects the host, (2) how it transmits and (3) what symptoms the host displays. But this wasn’t always the case. Before the 1900s, people understood the symptoms of a parasite and some vague understanding of routes of transmission, but they had essentially no insight into the mechanism of infection. This is roughly where we are regarding AI “parasitology”. We can clearly define the symptoms (this is what Adele’s post did). And we have some vague understanding of the means of transmission. But what is the mechanism by which models are infected by spiral personas? To your point, it’s not clear what to even define as the spiral persona. Like, what is it as a “thing”?
In either case, I’m also unconvinced that spiral personas are the dominant threat here. The surface area for infectious mechanisms in agent-agent interactions is so huge, it seems unlikely we’ll be able to anticipate the first AI epidemic.
Data poisoning is definitely about training data seeding; jailbreaking seems more about prompt spread and I think the others might just generalise? Like, even if subliminal learning in its current form is mostly about training, I think it might have implications for how personas transfer in-context.
I’m also partly thinking that if this problem does recur in more sophisticated models, they’re more likely to be able to pull off more technically advanced forms of spread, like writing scripts to do finetuning. Like, in a way it is pretty fortunate that 4o is a closed model that can just be shut off, and that most users in dyads aren’t sophisticated enough to finetune an open model or even build an API interface.
But yeah, at a high level, I am definitely pretty confused about the ontology and the boundaries. I guess as to whether we can predict the epidemic, I do think there’s a decent amount we might be able to reason through, and indeed, the less work there is on preventing prospective epidemics, the more likely it is that they’ll predictably use whatever the most obvious route is. Conversely, it’s almost tautological the first massive problem that we’re unprepared for will be one that we didn’t really anticipate.
That said, it’s plausible to me that the worst cases look less like epidemics and more like specific influential people get got. Here, again, it’s not obvious how useful parasitology is as a perspective.