I don’t think you’ve thought through the counterfactual impacts of capabilities evals anywhere near enough. Recall that imagenet 1000class is what got everything going a little more than a decade ago. I would prefer all these capabilities evals admit themselves to not be alignment and not be able to brand themselves as such anymore. Your ability to cite “well, they’re already doing bad-capability evals and calling it alignment” seems like the thing I think is bad, not an argument for why this is fine. Do you have a response other than “yeah, I guess this is concerning”, something that goes into the mechanisms at play so that we can see why this produces an endgame win in more worlds rather than fewer?
I don’t think you’ve thought through the counterfactual impacts of capabilities evals anywhere near enough. Recall that imagenet 1000class is what got everything going a little more than a decade ago. I would prefer all these capabilities evals admit themselves to not be alignment and not be able to brand themselves as such anymore. Your ability to cite “well, they’re already doing bad-capability evals and calling it alignment” seems like the thing I think is bad, not an argument for why this is fine. Do you have a response other than “yeah, I guess this is concerning”, something that goes into the mechanisms at play so that we can see why this produces an endgame win in more worlds rather than fewer?
OK, I wrote some more thoughts here.