I had this position since 2022, but this past year I’ve been very surprised and impressed by just how good black box methods can be e.g. the control agenda, Owain Evan’s work, Anthropic’s (& other’s I’m probably forgetting).
How to prove a negative: We can find evidence for or against a hypothesis, but rigorously proving the absence of deception circuits seems incredibly hard. How do you know you didn’t just miss it? How much of the model do you need to understand? 90%? 99%? 99.99%?
If you understand 99.9% of the model, then you can just run your understanding, leaving out the possible deception circuit in the 0.1% you couldn’t capture. Ideally this 99.9% is useful enough to automate research (or you use the 99.9% model as your trusted overseer as you try to bootstrap interp research to understand more percentage points of the model).
I agree in principle, but as far as I know, no interp explanation that has been produced explains more like 20-50% of the (tiny) parts of the model it’s trying to explain (e.g. see the causal scrubbing results, or our discussion with Neel). See that dialogue with Neel for more on the question of how much of the model we understand.
I disagree re the way we currently use understand—eg I think that SAE reconstructions have the potential to smuggle in lots of things via EG the exact values of the continuous activations, latents that don’t quite mean what we think, etc.
It’s plausible that a future and stricter definition of understand fixes this though, in which case I might agree? But I would still be concerned that 99.9% understanding involves a really long tale of heuristics and I don’t know what may emerge from combining many things that individually make sense. And I probably put >0.1% that a super intelligence could adversarially smuggle things we don’t like into a system we don’t think we understand.
Anyway, all that pedantry aside, my actual concern is tractability. If addressed, this seems plausibly helpful!
I had this position since 2022, but this past year I’ve been very surprised and impressed by just how good black box methods can be e.g. the control agenda, Owain Evan’s work, Anthropic’s (& other’s I’m probably forgetting).
If you understand 99.9% of the model, then you can just run your understanding, leaving out the possible deception circuit in the 0.1% you couldn’t capture. Ideally this 99.9% is useful enough to automate research (or you use the 99.9% model as your trusted overseer as you try to bootstrap interp research to understand more percentage points of the model).
I agree in principle, but as far as I know, no interp explanation that has been produced explains more like 20-50% of the (tiny) parts of the model it’s trying to explain (e.g. see the causal scrubbing results, or our discussion with Neel). See that dialogue with Neel for more on the question of how much of the model we understand.
I disagree re the way we currently use understand—eg I think that SAE reconstructions have the potential to smuggle in lots of things via EG the exact values of the continuous activations, latents that don’t quite mean what we think, etc.
It’s plausible that a future and stricter definition of understand fixes this though, in which case I might agree? But I would still be concerned that 99.9% understanding involves a really long tale of heuristics and I don’t know what may emerge from combining many things that individually make sense. And I probably put >0.1% that a super intelligence could adversarially smuggle things we don’t like into a system we don’t think we understand.
Anyway, all that pedantry aside, my actual concern is tractability. If addressed, this seems plausibly helpful!