This is really interesting work and a great write-up!
You write “Bad actors can harness the full offensive capabilities of models if they have access to the weights.” Compared to the harm a threat actor could do just with access to the public interface of a current frontier LLM, how bad would it be if a threat actor were to maliciously fine-tune the weights of an open-sourced (or exfiltrated) model? If someone wants to use a chatbot to write a ransomware program, it’s trivial to come up with a pretext that disguises one’s true intentions from the chatbot and gets it to spit out code that could be used maliciously.
It’s not clear to me that it would be that much worse for a threat actor to have a maliciously fine-tuned LLM vs the plain-old public GPT-3.5 interface that we all know and love. Maybe it would save them a small bit of time if they didn’t have to come up with a pretext? Or maybe there are offensive capabilities lurking in current models and somehow protected by safety fine-tuning, capabilities that can only be accessed by pointing to them specifically in one’s prompt?
To be clear, I strongly agree with your assessment that open-sourcing models is extremely reckless and on net a very bad idea. I also have a strong intuition that it makes a lot of sense to invest in security controls to prevent threat actors from exfiltrating model weights to use models for malicious purposes. I’m just not sure if the payoff from blocking such exfiltration attacks would be in the present/near term or if we would have to wait until the next generation of models (or the generation after that) before direct access to model weights grants significant offensive capabilities above and beyond those accessible through a public API.
Thanks again for this post and for your work!
EDIT: I realize you could run into infohazards if you respond to this. I certainly don’t want to push you to share anything infohazard-y, so I totally understand if you can’t respond.
I really appreciate it! I hope this ends up being useful to people. We tried to create the resource we would have wanted to have when we first started learning about infra-Bayesianism. Vanessa’s agenda is deeply interesting (especially to a math nerd like me!), and developing formal mathematical theories to help with alignment seems very neglected.