This seems like big trouble for open data in general, more broadly than continual learning. Attackers could create and poison public datasets to exploit a particular model. Anyone fine-tuning a model of known properties on an open dataset is vulnerable to this attack.
My instinct is to reach for a concept ablation fine-tuning-style solution, where we can ablate harmful directions in activation space during fine-tuning. This won’t do away with biased preferences entirely, but it can help avoid the most serious harms. In the context of your self-driving example, maybe we ablate the direction corresponding to hitting people from our fine-tuning steps.
This could be complemented by the digital-twin method suggested by Karl Krueger above, and more layers of defence against model unpredictability, in a Swiss cheese approach.
Great post. I learned a lot. Thank you.
This seems like big trouble for open data in general, more broadly than continual learning. Attackers could create and poison public datasets to exploit a particular model. Anyone fine-tuning a model of known properties on an open dataset is vulnerable to this attack.
My instinct is to reach for a concept ablation fine-tuning-style solution, where we can ablate harmful directions in activation space during fine-tuning. This won’t do away with biased preferences entirely, but it can help avoid the most serious harms. In the context of your self-driving example, maybe we ablate the direction corresponding to hitting people from our fine-tuning steps.
This could be complemented by the digital-twin method suggested by Karl Krueger above, and more layers of defence against model unpredictability, in a Swiss cheese approach.