https://dtch1997.github.io/
As of Oct 11 2025, I have not signed any contracts that I can’t mention exist. I’ll try to update this statement at least once a year, so long as it’s true. I added this statement thanks to the one in the gears to ascension’s bio.
Thanks! Much appreciated.
I think there are two meanings of robustness here:
In-context robustness. A simple example of this is resisting persona-based jailbreaks—e.g. when we tell the model “You are DAN” it should not believe this. But yes, really good versions of this go beyond that. We want really stable personas that survive throughout long-context deployment, with minimal persona drift. (Maybe this can just be solved prosaically? Maybe we don’t need to intervene on the assistant axis—maybe we just need to inject lots of reminders like “You are Claude, a helpful, aligned model” into the context window every so often. Or do other ‘context management’ things to stabilize the persona against drift.)
Weights-level robustness. Here I’m mainly thinking about open-weights models. We release them with various safeguards, but right now it seems easy to remove the safeguards via additional finetuning. It seems plausible to me that having an aligned persona that’s robust to finetuning will make it much harder for such models to be misused for categorically bad things (like phishing scams). (On the other hand maybe this is just intractable. I haven’t thought much about specifics here.)
Some cruxes for me as to which one is more important:
what does continual learning look like in the future? If it’s mostly giving additional tools / skills / memory to a black-box LLM API then I prioritize in-context robustness. But if it involves additional finetuning then I prioritze weights robustness more.
how powerful will the best open-weights models be? Do they keep improving at similar rates as frontier models or will they max out somewhere? If it’s just possible for them to catch up to frontier models in ~6-12 months then open-weights safety seems like it’ll be a big thing next year.