Using interpretability tools in the loss function incentivises a startling number of passive circumvention methods.
So why consider using it at all? There are two cases where it might be worthwhile to use:
If we’re absolutely 100% provably definitely sure that our interpretability methods cover every representational detail that the AI might be using, then it would be extremely hard or impossible to Goodhart interpretability tools in the loss function. But, for safety, we should probably assume that our tools are not able to interpret everything, making passive Goodharting quite likely.
To be confident including interpretability tools in your loss function, you don’t have to believe that there is no way those tools could possibly be Goodharted, you just have to believe that there is no way for gradient descent to find a way of Goodharting them. And the latter condition is far more achievable, since gradient descent is a process we can try to understand and interpret the same way we can try to understand and interpret a trained model. In particular, as I discuss here, if you could understand the process of gradient descent well enough to know why it’s proposing particular modifications to your model, you could use that understanding to be able to determine whether it was responding to your interpretability loss in the right way or not.
This sounds really reasonable. I had only been thinking of a naive version of interpretability tools in the loss function that doesn’t attempt to interpret the gradient descent process. I’d be genuinely enthusiastic about the strong version you outlined. I expect to think a lot about it in the near future.
To be confident including interpretability tools in your loss function, you don’t have to believe that there is no way those tools could possibly be Goodharted, you just have to believe that there is no way for gradient descent to find a way of Goodharting them. And the latter condition is far more achievable, since gradient descent is a process we can try to understand and interpret the same way we can try to understand and interpret a trained model. In particular, as I discuss here, if you could understand the process of gradient descent well enough to know why it’s proposing particular modifications to your model, you could use that understanding to be able to determine whether it was responding to your interpretability loss in the right way or not.
This sounds really reasonable. I had only been thinking of a naive version of interpretability tools in the loss function that doesn’t attempt to interpret the gradient descent process. I’d be genuinely enthusiastic about the strong version you outlined. I expect to think a lot about it in the near future.