Note: I am extremely open to other ideas on the below take and don’t have super high confidence in it
It seems plausible to me that successfully applying interpretability techniques to increase capabilities might be net-positive for safety.
You want to align the incentives of the companies training/deploying frontier models with safety. If interpretable systems are more economically valuable than uninterpretable systems, that seems good!
It seems very plausible to me that if interpretability never has any marginal benefit to capabilities, the little nuggets of interpretability we do have will be optimized away.
For instance, if you can improve capabilities slightly by allowing models to reason in latent space instead of in a chain of thought, that will probably end up being the default.
There’s probably a good deal of path dependence on the road to AGI and if capabilities are going to inevitably increase, perhaps it’s a good idea to nudge that progress in the direction of interpretable systems.
Note: I am extremely open to other ideas on the below take and don’t have super high confidence in it
It seems plausible to me that successfully applying interpretability techniques to increase capabilities might be net-positive for safety.
You want to align the incentives of the companies training/deploying frontier models with safety. If interpretable systems are more economically valuable than uninterpretable systems, that seems good!
It seems very plausible to me that if interpretability never has any marginal benefit to capabilities, the little nuggets of interpretability we do have will be optimized away.
For instance, if you can improve capabilities slightly by allowing models to reason in latent space instead of in a chain of thought, that will probably end up being the default.
There’s probably a good deal of path dependence on the road to AGI and if capabilities are going to inevitably increase, perhaps it’s a good idea to nudge that progress in the direction of interpretable systems.