I don’t know what Richard thinks, but I had a similar reaction when reading this. The way I would phrase it is that in order for the numbers go up approach to be meaningful, you have to be sure that the number going up is in fact tracking the real thing that you care about. I think without a solid understanding of what you’re working on, it’s easy to chose the wrong target. Which doesn’t mean that the exercise can’t be informative, it just means (imo) that you should track the hypothesis that you’re not measuring what you think you’re measuring as you do it. For instance, I would be tracking the hypothesis that features aren’t necessarily the right ontology for the mentalese of language models—perhaps dangerous mental patterns are hidden from us in a different computational form; one which makes the generalization from “implanted alignment issues” to “natural ones” a weak and inconclusive one.
I know alignment auditing doesn’t necessarily rely on using features per se, but I think until we have a solid understanding of how neural networks work, this fundamental issue will persist. And I think this severely limits the sorts of conclusions we can draw from tests like this. E.g., even if the alignment audit found the planted problem 100% of the time, I would still be pretty hesitant to conclude that a new model which passed the audit is aligned. Not only because without a science, my guess is that the alignment audit ends up measuring the wrong thing, but also because measuring the wrong thing is especially problematic here. I.e., part of what makes alignment so hard is that we might be dealing with a system optimizing against us, such that we should expect our blindspots to be exploited. And absent a good argument as to why we no longer have blindspots in our measurements (ways for the system to hide dangerous computation from us), I am skeptical of techniques like this providing much assurance against advanced systems.
I don’t know what Richard thinks, but I had a similar reaction when reading this. The way I would phrase it is that in order for the numbers go up approach to be meaningful, you have to be sure that the number going up is in fact tracking the real thing that you care about. I think without a solid understanding of what you’re working on, it’s easy to chose the wrong target. Which doesn’t mean that the exercise can’t be informative, it just means (imo) that you should track the hypothesis that you’re not measuring what you think you’re measuring as you do it. For instance, I would be tracking the hypothesis that features aren’t necessarily the right ontology for the mentalese of language models—perhaps dangerous mental patterns are hidden from us in a different computational form; one which makes the generalization from “implanted alignment issues” to “natural ones” a weak and inconclusive one.
I know alignment auditing doesn’t necessarily rely on using features per se, but I think until we have a solid understanding of how neural networks work, this fundamental issue will persist. And I think this severely limits the sorts of conclusions we can draw from tests like this. E.g., even if the alignment audit found the planted problem 100% of the time, I would still be pretty hesitant to conclude that a new model which passed the audit is aligned. Not only because without a science, my guess is that the alignment audit ends up measuring the wrong thing, but also because measuring the wrong thing is especially problematic here. I.e., part of what makes alignment so hard is that we might be dealing with a system optimizing against us, such that we should expect our blindspots to be exploited. And absent a good argument as to why we no longer have blindspots in our measurements (ways for the system to hide dangerous computation from us), I am skeptical of techniques like this providing much assurance against advanced systems.