Can you clarify what you mean by ‘neural analog’ / ‘single neural analog’? Is that meant as another term for what the post calls ‘simple correspondences’?
Even if all the safety-relevant properties have them, there’s no reason to believe (at least for now) that we have the interp tools to find them in time i.e., before having systems fully capable of pulling off a deception plan.
Agreed. I’m hopeful that perhaps mech interp will continue to improve and be automated fast enough for that to work, but I’m skeptical that that’ll happen. Or alternately I’m hopeful that we turn out to be in an easy-mode world where there is something like a single ‘deception’ direction that we can monitor, and that’ll at least buy us significant time before it stops working on more sophisticated systems (plausibly due to optimization pressure / selection pressure if nothing else).
I’m also worried that claims such as “we can make important forward progress on particular intentional states even in the absence of such a general account.” could further lead to a slippery slope that more or less embraces having the dangerous thing first without sufficient precautions
I agree that that’s a real risk; it makes me think of Andreessen Horowitz and others claiming in an open letter that interpretability had basically been solved and so AI regulation isn’t necessary. On the other hand, it seems better to state our best understanding plainly, even if others will slippery-slope it, than to take the epistemic hit of shifting our language in the other direction to compensate.
Can you clarify what you mean by ‘neural analog’ / ‘single neural analog’? Is that meant as another term for what the post calls ‘simple correspondences’?
Agreed. I’m hopeful that perhaps mech interp will continue to improve and be automated fast enough for that to work, but I’m skeptical that that’ll happen. Or alternately I’m hopeful that we turn out to be in an easy-mode world where there is something like a single ‘deception’ direction that we can monitor, and that’ll at least buy us significant time before it stops working on more sophisticated systems (plausibly due to optimization pressure / selection pressure if nothing else).
I agree that that’s a real risk; it makes me think of Andreessen Horowitz and others claiming in an open letter that interpretability had basically been solved and so AI regulation isn’t necessary. On the other hand, it seems better to state our best understanding plainly, even if others will slippery-slope it, than to take the epistemic hit of shifting our language in the other direction to compensate.