I feel sceptical about interpretability primarily because imagine that you have neural network that does useful superintelligent things because “cares about humans”. We have found Fourier transform in modular addition network because we already knew what Fourier transform is. But we have veeeery limited understanding of what “caring about humans” is from the math position.
I feel sceptical about interpretability primarily because imagine that you have neural network that does useful superintelligent things because “cares about humans”. We have found Fourier transform in modular addition network because we already knew what Fourier transform is. But we have veeeery limited understanding of what “caring about humans” is from the math position.