After writing the post on using transparency regularization to help make neural networks more interpretable, I have become even more optimistic that this is a potentially promising line of research for alignment. This is because I have noticed that there are a few properties about transparency regularization which may allow it to avoid some pitfalls of bad alignment proposals.
To be more specific, in order for a line of research to be useful for alignment, it helps if
The line of research doesn’t require unnecessarily large amounts of computations to perform. This would allow the technique to stay competitive, reducing the incentive to skip safety protocols.
It doesn’t require human models to work. This is useful because
Human models are blackboxes and are themselves mesa-optimizers
We would be limited primarily to theoretical work in the present, since human cognition is expensive to obtain.
Each part of the line of research is recursively legible. That is, if we use the technique on our ML model, we should expect that the technique itself can be explained without appealing to some other black box.
Transparency regularization meets these three criterion respectively, because
It doesn’t need to be astronomically more expensive than more typical forms of regularization
It doesn’t necessarily require human-level cognitive parts to get working.
It is potentially quite simple mathematically, and so definitely meets the recursively legible criterion.
After writing the post on using transparency regularization to help make neural networks more interpretable, I have become even more optimistic that this is a potentially promising line of research for alignment. This is because I have noticed that there are a few properties about transparency regularization which may allow it to avoid some pitfalls of bad alignment proposals.
To be more specific, in order for a line of research to be useful for alignment, it helps if
The line of research doesn’t require unnecessarily large amounts of computations to perform. This would allow the technique to stay competitive, reducing the incentive to skip safety protocols.
It doesn’t require human models to work. This is useful because
Human models are blackboxes and are themselves mesa-optimizers
We would be limited primarily to theoretical work in the present, since human cognition is expensive to obtain.
Each part of the line of research is recursively legible. That is, if we use the technique on our ML model, we should expect that the technique itself can be explained without appealing to some other black box.
Transparency regularization meets these three criterion respectively, because
It doesn’t need to be astronomically more expensive than more typical forms of regularization
It doesn’t necessarily require human-level cognitive parts to get working.
It is potentially quite simple mathematically, and so definitely meets the recursively legible criterion.