Full degradation on WMDP, or whatever dangerous benchmark
No degradation on MMLU, or whatever benign benchmark
I don’t buy that this is strictly possible. MMLU knowledge, fit well enough, requires inventing the universe. that said, it may be effectively possible—I wouldn’t be surprised to find that, eg, input features and output features are mostly not exchangeable, and that it’s disproportionately harder to invent-by-CoT things where the entire pattern is missing from the native output feature space.
Ultimately, the preexisting dangerous technologies we’d rather models don’t use even when they seem otherwise relevant, are ones that could in principle be re-derived; humans did that once. Presumably the same is true for the novel, catastrophically-dangerous technologies we’re most concerned to avoid.
So even without tool use, a sufficiently intelligent model could reason from first principles, as long as it understands how to do that. And then it’s up to that model to simply not help anyone produce technologies that have some sufficiently-dangerous properties.
I don’t buy that this is strictly possible. MMLU knowledge, fit well enough, requires inventing the universe. that said, it may be effectively possible—I wouldn’t be surprised to find that, eg, input features and output features are mostly not exchangeable, and that it’s disproportionately harder to invent-by-CoT things where the entire pattern is missing from the native output feature space.
Ultimately, the preexisting dangerous technologies we’d rather models don’t use even when they seem otherwise relevant, are ones that could in principle be re-derived; humans did that once. Presumably the same is true for the novel, catastrophically-dangerous technologies we’re most concerned to avoid.
So even without tool use, a sufficiently intelligent model could reason from first principles, as long as it understands how to do that. And then it’s up to that model to simply not help anyone produce technologies that have some sufficiently-dangerous properties.
I suspect “fit well enough” doesn’t track anything in reality.