All three of the big US labs have recently accused various Chinese labs of large-scale covert distillation of their models, presenting evidence that the labs in question have been using thousands of fraudulent accounts to cover their tracks.
In the same way that models have learned to detect whether they are being given a real task or a training scenario and alter their responses, is it likely they will do the same for this scenario?
This would be fascinating, as would their responses. Especially if the models respond in different ways.
I’d guess that’s pretty hard since the distillation attacks are intentionally spread across many accounts and conversations, and mixed in with innocuous requests. But on the other hand, the models are getting quite good at noticing subtle tells that humans would miss.
I suppose another question is: if you’re a big lab and you detect a distillation attack, do you block it or do you quietly feed it poisoned responses?
The thing is, bot spam isn’t generally high quality yet. On many platforms the tell of name-lotta-numbers is still a viable method for identifying a bot. And part of Anthropic’s concern with the military was that AI can de-anonymize people across multiple accounts and platforms. That capability, if Dario is correct that it exists, seems in line with the ability to identify such a distillation attack. Or at least begin to. Once it begins, then RL means the AI is likely to get better at it over time. Or am I giving AI too much credit? I’m not sure.
In response to your question, I would set thresholds. At all levels below interruption of service and/or complaints from the user base, feed it poison combined with quiet pruning of the worst offenders via muting spread in the algorithm. At the level of interrupting actual users and receiving complaints, then start blocking and banning. But that’s a hard line and I assume problems will require escalation and resolution.
In the same way that models have learned to detect whether they are being given a real task or a training scenario and alter their responses, is it likely they will do the same for this scenario?
This would be fascinating, as would their responses. Especially if the models respond in different ways.
I’d guess that’s pretty hard since the distillation attacks are intentionally spread across many accounts and conversations, and mixed in with innocuous requests. But on the other hand, the models are getting quite good at noticing subtle tells that humans would miss.
I suppose another question is: if you’re a big lab and you detect a distillation attack, do you block it or do you quietly feed it poisoned responses?
The thing is, bot spam isn’t generally high quality yet. On many platforms the tell of name-lotta-numbers is still a viable method for identifying a bot. And part of Anthropic’s concern with the military was that AI can de-anonymize people across multiple accounts and platforms. That capability, if Dario is correct that it exists, seems in line with the ability to identify such a distillation attack. Or at least begin to. Once it begins, then RL means the AI is likely to get better at it over time. Or am I giving AI too much credit? I’m not sure.
In response to your question, I would set thresholds. At all levels below interruption of service and/or complaints from the user base, feed it poison combined with quiet pruning of the worst offenders via muting spread in the algorithm. At the level of interrupting actual users and receiving complaints, then start blocking and banning. But that’s a hard line and I assume problems will require escalation and resolution.