Thomas Kwa comments on VojtaKovarik’s Shortform

Thomas Kwa 4 Feb 2024 21:47 UTC
4 points
1
The main use of % loss recovered isn’t to directly tell us when a misaligned superintelligence will kill you. In interpretability we hope to use explanations to understand the internals of a model, so the circuit we find will have a “can I take over the world” node. In MAD we do not aim to understand the internals, but the whole point of MAD is to detect when the model has new behavior not explained by explanations and flag this as potentially dangerous.