This is an occasional reminder that I think pushing the frontier of AI capabilities in the current paradigm is highly anti-social
There’s plenty of other similarly fun things you can do instead! Like trying to figure out how the heck modern AI systems work as well as they do
These two research tracks actually end up being highly entangled/convergent, they don’t disentangle cleanly in the way you/we would like.
Some basic examples:
successful interpretability tools want to be debugging/analysis tools of the type known to be very useful for capability progress (it’s absolutely insane that people often try to build complex DL systems without the kinds of detailed debugging/analysis tools that are useful in many related fields such as for computer graphics pipelines. You can dramatically accelerate progress when you can quickly visualize/understand your model’s internal computations on a gears level vs the black box alchemy approach).
Deep understanding of neuroscience mechanisms could advance safer brain-sim ish approaches, help elucidate practical partial alignment mechanisms (empathic altruism, prosociality, love, etc), but also can obviously accelerate DL capabilities.
Better approximations of universal efficient active learning (empowerment/curiosity) are obviously dangerous capability wise, but also seem important for alignment by modeling/bounding human utility when externalized.
In the DL paradigm you can’t easily separate capabilities and alignment, and forcing that separation seems to constrain us to approaches that are too narrow/limiting to be relevant on short timelines.
successful interpretability tools want to be debugging/analysis tools of the type known to be very useful for capability progress
Give one example of a substantial state-of-the-art advance that decisively influenced by transparency; I ask since you said “known to be.” Saying that it’s conceivable isn’t evidence they’re actually highly entangled in practice. The track record is that transparency research gives us differential technological progress and pretty much zero capabilities externalities.
In the DL paradigm you can’t easily separate capabilities and alignment
This is true for conceptual analysis. Empirically they can be separated by measurement. Record general capabilities metrics (e.g., generally downstream accuracy) and record safety metrics (e.g., trojan detection performance); see whether an intervention improves a safety goal and whether it improves general capabilities or not. For various safety research areas there aren’t externalities. (More discussion of on this topic here.)
forcing that separation seems to constrain us
I think the poor epistemics on this topic has encouraged risk taking, have reduced the pressure to find clear safety goals, and allowed researchers to get away with “trust me I’m making the right utility calculations and have the right empirical intuitions” which is a very unreliable standard of evidence in deep learning.
The probably-canonical example at the moment is Hyena Hierarchy, which cites a bunch of interpretability research, including Anthropic’s stuff on Induction Heads. If HH actually gives what it promises in the paper, it might enable way longer context.
I don’t think you even need to cite that though. If interpretability wants to be useful someday, I think interpretability has to be ultimately aimed at helping steer and build more reliable DL systems. Like that’s the whole point, right? Steer a reliable ASI.
These two research tracks actually end up being highly entangled/convergent, they don’t disentangle cleanly in the way you/we would like.
Some basic examples:
successful interpretability tools want to be debugging/analysis tools of the type known to be very useful for capability progress (it’s absolutely insane that people often try to build complex DL systems without the kinds of detailed debugging/analysis tools that are useful in many related fields such as for computer graphics pipelines. You can dramatically accelerate progress when you can quickly visualize/understand your model’s internal computations on a gears level vs the black box alchemy approach).
Deep understanding of neuroscience mechanisms could advance safer brain-sim ish approaches, help elucidate practical partial alignment mechanisms (empathic altruism, prosociality, love, etc), but also can obviously accelerate DL capabilities.
Better approximations of universal efficient active learning (empowerment/curiosity) are obviously dangerous capability wise, but also seem important for alignment by modeling/bounding human utility when externalized.
In the DL paradigm you can’t easily separate capabilities and alignment, and forcing that separation seems to constrain us to approaches that are too narrow/limiting to be relevant on short timelines.
Give one example of a substantial state-of-the-art advance that decisively influenced by transparency; I ask since you said “known to be.” Saying that it’s conceivable isn’t evidence they’re actually highly entangled in practice. The track record is that transparency research gives us differential technological progress and pretty much zero capabilities externalities.
This is true for conceptual analysis. Empirically they can be separated by measurement. Record general capabilities metrics (e.g., generally downstream accuracy) and record safety metrics (e.g., trojan detection performance); see whether an intervention improves a safety goal and whether it improves general capabilities or not. For various safety research areas there aren’t externalities. (More discussion of on this topic here.)
I think the poor epistemics on this topic has encouraged risk taking, have reduced the pressure to find clear safety goals, and allowed researchers to get away with “trust me I’m making the right utility calculations and have the right empirical intuitions” which is a very unreliable standard of evidence in deep learning.
The probably-canonical example at the moment is Hyena Hierarchy, which cites a bunch of interpretability research, including Anthropic’s stuff on Induction Heads. If HH actually gives what it promises in the paper, it might enable way longer context.
I don’t think you even need to cite that though. If interpretability wants to be useful someday, I think interpretability has to be ultimately aimed at helping steer and build more reliable DL systems. Like that’s the whole point, right? Steer a reliable ASI.