I don’t follow why we shouldn’t use neural networks to find isomorphic structures. Yes, asking a liar if he’s honest is silly; but 1. asking a traitor to annotate their thought transcripts still makes it harder for them to plan a coup if we can check some of their work; 2. if current networks can annotate current networks, we might fix the latter’s structure and scale its components up in a manner less likely to shift paradigms.
A useful heuristic: for alignment purposes, most of the value of an interpretability technique comes not from being able to find things, but being able to guarantee that we didn’t miss anything. The ability to check only some of the work has very little impact on the difficulty of alignment—in particular, it means we cannot apply any optimization pressure at all to that interpretability method (including optimization pressure like “the humans try new designs until they find one which doesn’t raise any problems visible to the interpretability tools”). The main channel through which partial interpretability would be useful is if it leads to a method for more comprehensive interpretability.
Plus, structures are either isomorphic or they’re not. If they are, humans examining them should be able to check the work of the annotator and figure out why it said they were isomorphic in the first place. This seems to me like the AI would be a way of focusing human attention more efficiently on the places that are interesting in the neural net, rather than doing the work for us, which obviously doesn’t aid our own ability to interpret at all!
I don’t follow why we shouldn’t use neural networks to find isomorphic structures. Yes, asking a liar if he’s honest is silly; but 1. asking a traitor to annotate their thought transcripts still makes it harder for them to plan a coup if we can check some of their work; 2. if current networks can annotate current networks, we might fix the latter’s structure and scale its components up in a manner less likely to shift paradigms.
A useful heuristic: for alignment purposes, most of the value of an interpretability technique comes not from being able to find things, but being able to guarantee that we didn’t miss anything. The ability to check only some of the work has very little impact on the difficulty of alignment—in particular, it means we cannot apply any optimization pressure at all to that interpretability method (including optimization pressure like “the humans try new designs until they find one which doesn’t raise any problems visible to the interpretability tools”). The main channel through which partial interpretability would be useful is if it leads to a method for more comprehensive interpretability.
Plus, structures are either isomorphic or they’re not. If they are, humans examining them should be able to check the work of the annotator and figure out why it said they were isomorphic in the first place. This seems to me like the AI would be a way of focusing human attention more efficiently on the places that are interesting in the neural net, rather than doing the work for us, which obviously doesn’t aid our own ability to interpret at all!