I agree with this. I think people are bad at running that calculation, and consciously turning down status in general, so I advocate for this position because I think its basically true for many.
Most mechanistic interpretability is not in fact focused on the specific sub-problem you identify, its wandering around in a billion-parameter maze, taking note of things that look easy & interesting to understand, and telling people to work on understanding those things. I expect this to produce far more capabilities relevant insights than alignment relevant insights, especially when compared to worlds where Neel et al went in with the sole goal of separating out theories of value formation, and then did nothing else.
There’s a case to be made for exploration, but the rules of the game get wonky when you’re trying to do differential technological development. There becomes strategically relevant information you want to not know.
I expect this to produce far more capabilities relevant insights than alignment relevant insights, especially when compared to worlds where Neel et al went in with the sole goal of separating out theories of value formation, and then did nothing else.
I assume here you mean something like: given how most MI projects seem to be done, the most likely output of all these projects will be concrete interventions to make it easier for a model to become more capable, and these concrete interventions will have little to no effect on making it easier for us to direct a model towards having the ‘values’ we want it to have.
I agree with this. I think people are bad at running that calculation, and consciously turning down status in general, so I advocate for this position because I think its basically true for many.
Most mechanistic interpretability is not in fact focused on the specific sub-problem you identify, its wandering around in a billion-parameter maze, taking note of things that look easy & interesting to understand, and telling people to work on understanding those things. I expect this to produce far more capabilities relevant insights than alignment relevant insights, especially when compared to worlds where Neel et al went in with the sole goal of separating out theories of value formation, and then did nothing else.
There’s a case to be made for exploration, but the rules of the game get wonky when you’re trying to do differential technological development. There becomes strategically relevant information you want to not know.
I assume here you mean something like: given how most MI projects seem to be done, the most likely output of all these projects will be concrete interventions to make it easier for a model to become more capable, and these concrete interventions will have little to no effect on making it easier for us to direct a model towards having the ‘values’ we want it to have.
I agree with this claim: capabilities generalize very easily, while it seems extremely unlikely for there to be ‘alignment generalization’ in a way that we intend, by default. So the most likely outcome of more MI research does seem to be interventions that remove the obstacles that come in the way of achieving AGI, while not actually making progress on ‘alignment generalization’.
Indeed, this is what I mean.