We expect this to become even more of an issue as AIs increasingly use tools to do their own research rather than rely on their learned knowledge (we tried to filter this kind of data as well, but it wasn’t enough assurance against misuse).
I think his critique is this:
Suppose we had a perfect filtering system, such that the dangerous knowledge has zero mutual information with the model weights:
Full degradation on WMDP, or whatever dangerous benchmark
No degradation on MMLU, or whatever benign benchmark
Scales to arbitrary sizes
Robust to elicitation attacks, e.g. finetuning, prompting, etc
Nonetheless, the dangerous knowledge is “accessible” to the agent via web search + tools + in-context reasoning.
To solve this problem, we need either alignment techniques (e.g. train the model not to use these affordances) or inference-time monitoring techniques (e.g. constitutional classifiers). But if we had those techniques then we don’t need the pretraining filtering.
Full degradation on WMDP, or whatever dangerous benchmark
No degradation on MMLU, or whatever benign benchmark
I don’t buy that this is strictly possible. MMLU knowledge, fit well enough, requires inventing the universe. that said, it may be effectively possible—I wouldn’t be surprised to find that, eg, input features and output features are mostly not exchangeable, and that it’s disproportionately harder to invent-by-CoT things where the entire pattern is missing from the native output feature space.
Ultimately, the preexisting dangerous technologies we’d rather models don’t use even when they seem otherwise relevant, are ones that could in principle be re-derived; humans did that once. Presumably the same is true for the novel, catastrophically-dangerous technologies we’re most concerned to avoid.
So even without tool use, a sufficiently intelligent model could reason from first principles, as long as it understands how to do that. And then it’s up to that model to simply not help anyone produce technologies that have some sufficiently-dangerous properties.
Jerry Wei writes:
I think his critique is this:
Suppose we had a perfect filtering system, such that the dangerous knowledge has zero mutual information with the model weights:
Full degradation on WMDP, or whatever dangerous benchmark
No degradation on MMLU, or whatever benign benchmark
Scales to arbitrary sizes
Robust to elicitation attacks, e.g. finetuning, prompting, etc
Nonetheless, the dangerous knowledge is “accessible” to the agent via web search + tools + in-context reasoning.
To solve this problem, we need either alignment techniques (e.g. train the model not to use these affordances) or inference-time monitoring techniques (e.g. constitutional classifiers). But if we had those techniques then we don’t need the pretraining filtering.
I don’t buy that this is strictly possible. MMLU knowledge, fit well enough, requires inventing the universe. that said, it may be effectively possible—I wouldn’t be surprised to find that, eg, input features and output features are mostly not exchangeable, and that it’s disproportionately harder to invent-by-CoT things where the entire pattern is missing from the native output feature space.
Ultimately, the preexisting dangerous technologies we’d rather models don’t use even when they seem otherwise relevant, are ones that could in principle be re-derived; humans did that once. Presumably the same is true for the novel, catastrophically-dangerous technologies we’re most concerned to avoid.
So even without tool use, a sufficiently intelligent model could reason from first principles, as long as it understands how to do that. And then it’s up to that model to simply not help anyone produce technologies that have some sufficiently-dangerous properties.
I suspect “fit well enough” doesn’t track anything in reality.