I sort of think of activation oracles as “chain of thought monitoring, but for activations”
Recall the basic chain-of-thought monitoring safety case: if chain of thought is necessary for a given model to complete a task, then (barring sophisticated encoding strategies) the chain of thought should be monitorable for that (task, model) pair.
As models get bigger / deeper, the space of tasks where chain-of-thought is necessary decreases, so (all else equal) chain of thought monitor-ability decreases.
How do we get around this? Well, assuming the model is using some kind of general purpose reasoning internally, that general purpose reasoning should have some general structure. And you know what’s great at learning general structures: neural networks!
We won’t be able to capture all the cognition of a model (that would require ambitious interp.), but we may be able to capture all the cognition that routes through something like “internal general purpose reasoning”.
How well activation oracles learn this general structure (/the degree to which there’s a general structure at all) is an empirical question, but we can get rich signal on this question by training activation oracles, and evaluating them in on auditing games, unsupervised elicitation, etc.
Now you might argue that this is a very contingent defense of activation oracles—if for a set of evals where you would expect activation oracles to work, a different method out-performs activation oracles, then should we abandon them for the favored method? (since our only evidence for this “general internal reasoning structure” is via downstream usefulness of activation oracles). Concretely, it seems like dumb prefill black-box methods mostly out-perform activation oracles in auditing evals. Given this, why favor activation oracles over dumb black box methods (or smarter black box methods like confessionals)
Given current public evidence, you have to rely on conceptual arguments to justify favoring activation oracles: intentionally subverting activation oracles seems harder for a schemer than intentionally subverting black box methods. I expect we should be able to show this in a compelling way with stress-testing / control evals (but defining a “fair-fight” is non-trivial since in-context red-team would unrealistically favor AOs—see this post for some thoughts on this).
I sort of think of activation oracles as “chain of thought monitoring, but for activations”
Recall the basic chain-of-thought monitoring safety case: if chain of thought is necessary for a given model to complete a task, then (barring sophisticated encoding strategies) the chain of thought should be monitorable for that (task, model) pair.
As models get bigger / deeper, the space of tasks where chain-of-thought is necessary decreases, so (all else equal) chain of thought monitor-ability decreases.
How do we get around this? Well, assuming the model is using some kind of general purpose reasoning internally, that general purpose reasoning should have some general structure. And you know what’s great at learning general structures: neural networks!
We won’t be able to capture all the cognition of a model (that would require ambitious interp.), but we may be able to capture all the cognition that routes through something like “internal general purpose reasoning”.
How well activation oracles learn this general structure (/the degree to which there’s a general structure at all) is an empirical question, but we can get rich signal on this question by training activation oracles, and evaluating them in on auditing games, unsupervised elicitation, etc.
Now you might argue that this is a very contingent defense of activation oracles—if for a set of evals where you would expect activation oracles to work, a different method out-performs activation oracles, then should we abandon them for the favored method? (since our only evidence for this “general internal reasoning structure” is via downstream usefulness of activation oracles). Concretely, it seems like dumb prefill black-box methods mostly out-perform activation oracles in auditing evals. Given this, why favor activation oracles over dumb black box methods (or smarter black box methods like confessionals)
Given current public evidence, you have to rely on conceptual arguments to justify favoring activation oracles: intentionally subverting activation oracles seems harder for a schemer than intentionally subverting black box methods. I expect we should be able to show this in a compelling way with stress-testing / control evals (but defining a “fair-fight” is non-trivial since in-context red-team would unrealistically favor AOs—see this post for some thoughts on this).