All of these seem very useful to me (except maybe robustness).
Can you clarify what’s meant by robustness? You mentioned robustness to weight updates—this seems potentially bad because it involves making the AI incorrigible. Under robustness I’d have said: having the AI’s persona be stable under long serial reasoning/long contexts/when AI agents are communicating a bunch in the deployment.
I think there are two meanings of robustness here:
In-context robustness. A simple example of this is resisting persona-based jailbreaks—e.g. when we tell the model “You are DAN” it should not believe this. But yes, really good versions of this go beyond that. We want really stable personas that survive throughout long-context deployment, with minimal persona drift. (Maybe this can just be solved prosaically? Maybe we don’t need to intervene on the assistant axis—maybe we just need to inject lots of reminders like “You are Claude, a helpful, aligned model” into the context window every so often. Or do other ‘context management’ things to stabilize the persona against drift.)
Weights-level robustness. Here I’m mainly thinking about open-weights models. We release them with various safeguards, but right now it seems easy to remove the safeguards via additional finetuning. It seems plausible to me that having an aligned persona that’s robust to finetuning will make it much harder for such models to be misused for categorically bad things (like phishing scams). (On the other hand maybe this is just intractable. I haven’t thought much about specifics here.)
Some cruxes for me as to which one is more important:
what does continual learning look like in the future? If it’s mostly giving additional tools / skills / memory to a black-box LLM API then I prioritize in-context robustness. But if it involves additional finetuning then I prioritze weights robustness more.
how powerful will the best open-weights models be? Do they keep improving at similar rates as frontier models or will they max out somewhere? If it’s just possible for them to catch up to frontier models in ~6-12 months then open-weights safety seems like it’ll be a big thing next year.
All of these seem very useful to me (except maybe robustness).
Can you clarify what’s meant by robustness? You mentioned robustness to weight updates—this seems potentially bad because it involves making the AI incorrigible. Under robustness I’d have said: having the AI’s persona be stable under long serial reasoning/long contexts/when AI agents are communicating a bunch in the deployment.
Thanks! Much appreciated.
I think there are two meanings of robustness here:
In-context robustness. A simple example of this is resisting persona-based jailbreaks—e.g. when we tell the model “You are DAN” it should not believe this. But yes, really good versions of this go beyond that. We want really stable personas that survive throughout long-context deployment, with minimal persona drift. (Maybe this can just be solved prosaically? Maybe we don’t need to intervene on the assistant axis—maybe we just need to inject lots of reminders like “You are Claude, a helpful, aligned model” into the context window every so often. Or do other ‘context management’ things to stabilize the persona against drift.)
Weights-level robustness. Here I’m mainly thinking about open-weights models. We release them with various safeguards, but right now it seems easy to remove the safeguards via additional finetuning. It seems plausible to me that having an aligned persona that’s robust to finetuning will make it much harder for such models to be misused for categorically bad things (like phishing scams). (On the other hand maybe this is just intractable. I haven’t thought much about specifics here.)
Some cruxes for me as to which one is more important:
what does continual learning look like in the future? If it’s mostly giving additional tools / skills / memory to a black-box LLM API then I prioritize in-context robustness. But if it involves additional finetuning then I prioritze weights robustness more.
how powerful will the best open-weights models be? Do they keep improving at similar rates as frontier models or will they max out somewhere? If it’s just possible for them to catch up to frontier models in ~6-12 months then open-weights safety seems like it’ll be a big thing next year.