The biggest thing that worries me about the idea of interpretability, which you mention, is that any sufficiently low-level interpretation of a giant, intractably complex AGI-level model would likely be also intractably complex. And any interpretation of that. And so on so forth, until you start getting the feel that you’ll probably need AI to interpret the interpretation, and then AI to interpret the interpreter, and so on in a chain which you might try to carefully validate but that increasingly feels like a typical Godzilla Strategy. This does not lead to rising property values in Tokyo.
That said, maybe it can be done, and even be reliable enough. But it would also enhance significantly our ability to distil models. Like, if you could take a NN-based model, interpret it, and map it to a GOFAI-style extremely interpretable system, now you probably have a much faster, leaner and cleaner version of the same AI—so you can probably just build an even bigger AI. And the question then becomes if this style of interpretability can ever catch up to the increase in capabilities it would automatically foster.
The biggest thing that worries me about the idea of interpretability, which you mention, is that any sufficiently low-level interpretation of a giant, intractably complex AGI-level model would likely be also intractably complex. And any interpretation of that. And so on so forth, until you start getting the feel that you’ll probably need AI to interpret the interpretation, and then AI to interpret the interpreter, and so on in a chain which you might try to carefully validate but that increasingly feels like a typical Godzilla Strategy. This does not lead to rising property values in Tokyo.
That said, maybe it can be done, and even be reliable enough. But it would also enhance significantly our ability to distil models. Like, if you could take a NN-based model, interpret it, and map it to a GOFAI-style extremely interpretable system, now you probably have a much faster, leaner and cleaner version of the same AI—so you can probably just build an even bigger AI. And the question then becomes if this style of interpretability can ever catch up to the increase in capabilities it would automatically foster.