What would it look like if such a model produces code or, more generally, uses any skill that entails using a domain specific language? I guess in the case of programming even keywords like “if” could be translated into Sumerian, but I can imagine that there are tasks that you cannot obfuscate this way. For example, the model might do math by outputting only strings of mathematical notation.
Also, it seems likely that frontier models will all be multi-modal, so they will have other forms of communication that don’t require language anyway. I suppose for image generating models, you could train the frontier model to apply an obfuscating filter and have the reporter restore the image iff it is relevant to medical research.
Maybe my biggest concern is that I’m not convinced that we will be safe from manipulation by restricting the model’s communication to being only about e.g. medical research. I don’t have a precise story for how, but it seems plausible that a misaligned models could output instructions for medical research, which also have side-effects that work towards the model’s secret goal.
Nitpicks aside, I like the idea of adding a safety layer by preventing the model to communicate directly with humans! And my concerns feel like technical issues that could easily be worked out given enough resources.
Thank you for this post. It looks like the people at Anthropic have put a lot of thought into this which is good to see.
You mention that there are often surprising qualitative differences between larger and smaller models. How seriously is Anthropic considering a scenario where there is a sudden jump in certain dangerous capabilities (in particular deception) at some level of model intelligence? Does it seem plausible that it might not be possible to foresee this jump from experiments on even slighter weaker models?