scasper(Stephen Casper)
The 6D effect: When companies take risks, one email can be very powerful.
Deep Forgetting & Unlearning for Safely-Scoped LLMs
[Linkpost] A survey on over 300 works about interpretability in deep networks
Takeaways from the Mechanistic Interpretability Challenges
Analogies between scaling labs and misaligned superintelligent AI
Open Problems and Fundamental Limitations of RLHF
EIS V: Blind Spots In AI Safety Interpretability Research
EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety
The Engineer’s Interpretability Sequence (EIS) I: Intro
Eight Strategies for Tackling the Hard Part of the Alignment Problem
Existential AI Safety is NOT separate from near-term applications
EIS VII: A Challenge for Mechanists
Dissolving Confusion around Functional Decision Theory
Where to be an AI Safety Professor
EIS IX: Interpretability and Adversaries
Deep Dives: My Advice for Pursuing Work in Research
EIS II: What is “Interpretability”?
I get the impression of a certain of motte and bailey in this comment and similar arguments. From a high-level, the notion of better understanding what neural networks are doing would be great. The problem though seems to be that most of the SOTA of research in interpretability does not seem to be doing a good job of this in a way that seems useful for safety anytime soon. In that sense, I think this comment talks past the points that this post is trying to make.
My answer to this is actually tucked into one paragraph on the 10th page of the paper: “This type of approach is valuable...reverse engineering a system”. We cite examples of papers that have used interpretability tools to generate novel adversaries, aid in manually-finetuning a network to induce a predictable change, or reverse engineer a network. Here they are.
Making adversaries:
https://distill.pub/2019/activation-atlas/
https://arxiv.org/abs/2110.03605
https://arxiv.org/abs/1811.12231
https://arxiv.org/abs/2201.11114
https://arxiv.org/abs/2206.14754
https://arxiv.org/abs/2106.03805
https://arxiv.org/abs/2006.14032
https://arxiv.org/abs/2208.08831
https://arxiv.org/abs/2205.01663
Manual fine-tuning:
https://arxiv.org/abs/2202.05262
https://arxiv.org/abs/2105.04857
Reverse engineering (I’d put an asterisk on these ones though because I don’t expect methods like this to scale well to non-toy problems):
https://distill.pub/2020/circuits/curve-detectors/
There seems to be high variance in the scope of the challenges that Katja has been tackling recently.