Thanks for writing this! I appreciate hearing how all this stuff reads to you.
I’m writing this comment to push back about current interpretability work being relevant to the lethal stuff that comes later, ala:
I have heard claims that interpretability is making progress, that we have some idea about some giant otherwise inscrutable matrices and that this knowledge is improving over time.
What I’ve seen folks understand so far are parts of perception in image processing neural nets, as well as where certain visual concepts show up in these nets, and more recently some of the structure of small transformers piping around information.
The goalpost for this sort of work mattering in the lethal regime is something like improving our ability to watch concepts move through a large mind made out of a blob of numbers, with sufficient fidelity to notice when it’s forming understandings of its operators, plans to disable them and escape, or anything much subtler but still lethal.
So I see interpretability falling far short here. In my book this is mostly because interpretability for a messy AGI mind inherits the abject difficulty of making a cleaned up version of that AGI with the same capability level.
We’re also making bounds of anti-progress on AGI Cleanliness every year. This makes everything that much harder.
Thanks for writing this! I appreciate hearing how all this stuff reads to you.
I’m writing this comment to push back about current interpretability work being relevant to the lethal stuff that comes later, ala:
What I’ve seen folks understand so far are parts of perception in image processing neural nets, as well as where certain visual concepts show up in these nets, and more recently some of the structure of small transformers piping around information.
The goalpost for this sort of work mattering in the lethal regime is something like improving our ability to watch concepts move through a large mind made out of a blob of numbers, with sufficient fidelity to notice when it’s forming understandings of its operators, plans to disable them and escape, or anything much subtler but still lethal.
So I see interpretability falling far short here. In my book this is mostly because interpretability for a messy AGI mind inherits the abject difficulty of making a cleaned up version of that AGI with the same capability level.
We’re also making bounds of anti-progress on AGI Cleanliness every year. This makes everything that much harder.