From a recent interview between Bill Gates & Sam Altman:
Gates: “We know the numbers [in a NN], we can watch it multiply, but the idea of where is Shakespearean encoded? Do you think we’ll gain an understanding of the representation?”
Altman: “A hundred percent…There has been some very good work on interpretability, and I think there will be more over time…The little bits we do understand have, as you’d expect, been very helpful in improving these things. We’re all motivated to really understand them…”
To the extent that a particular line of research can be described as “understand better what’s going on inside NNs”, is there a general theory of change for that? Understanding them better is clearly good for safety, of course! But in the general case, does it contribute more to safety than to capabilities?
people have repeatedly made the argument that it contributes more to capabilities on this forum, and so far it hasn’t seemed to convince that many interpretability researchers. I personally suspect this is largely because they’re motivated by capabilities curiosity and don’t want to admit it, whether that’s in public or even to themselves.
Thanks—any good examples spring to mind off the top of your head?
I’m not sure my desire to do interpretability comes from capabilities curiosity, but it certainly comes in part frominterpretability curiosity; I’d really like to know what the hell is going on in there...
Something I’m grappling with:
From a recent interview between Bill Gates & Sam Altman:
Gates: “We know the numbers [in a NN], we can watch it multiply, but the idea of where is Shakespearean encoded? Do you think we’ll gain an understanding of the representation?”
Altman: “A hundred percent…There has been some very good work on interpretability, and I think there will be more over time…The little bits we do understand have, as you’d expect, been very helpful in improving these things. We’re all motivated to really understand them…”
To the extent that a particular line of research can be described as “understand better what’s going on inside NNs”, is there a general theory of change for that? Understanding them better is clearly good for safety, of course! But in the general case, does it contribute more to safety than to capabilities?
people have repeatedly made the argument that it contributes more to capabilities on this forum, and so far it hasn’t seemed to convince that many interpretability researchers. I personally suspect this is largely because they’re motivated by capabilities curiosity and don’t want to admit it, whether that’s in public or even to themselves.
Thanks—any good examples spring to mind off the top of your head?
I’m not sure my desire to do interpretability comes from capabilities curiosity, but it certainly comes in part frominterpretability curiosity; I’d really like to know what the hell is going on in there...