I agree that people who could do either good interpretability or conceptual work should focus on conceptual work. Also, to be clear the rest of this comment is not necessarily a defence of doing interpretability work in particular, but a response to the specific kind of mental model of research you’re describing.
I think it’s important that research effort is not fungible. Interpretability has a pretty big advantage that unlike conceptual work, a) it has tight feedback loops, b) is much more paradigmatic, c) is much easier to get into for people with an ML research background.
Plausibly the most taut constraint in research is not strictly the number of researchers you can fund/train to solve a given problem—it’s hard to get researchers to do good work if they don’t feel intellectually excited about the problem, which in turn is less likely if they feel like they’re never making any progress, or feel like they are constantly unsure about what problem they’re even trying to solve.
To be clear I am not arguing that we should focus on things that are easier to solve—I am very much in favor of not just doing things that are easy to do but actually don’t help (“looking under the streetlamp”). Rather, I think what we should be doing is finding things that actually matter and making it easier for people to get excited about it (and people who are able to do this kind of work have a huge comparative advantage here!).
I agree that people who could do either good interpretability or conceptual work should focus on conceptual work
This seems like a false dichotomy; in general I expect that the best conceptual work will be done in close conjunction with interpretability work or other empirical work.
(In general I think that almost all attempts to do “conceptual” work that doesn’t involve either empirical results or proofs is pretty doomed. I’d be interested in any counterexamples you’ve seen; my main counterexample is threat modeling, which is why I’ve been focusing a lot on that lately.)
EDIT: many downvotes, no counterexamples. Please provide some.
I agree that doing conceptual work in conjunction with empirical work is good. I don’t know if I agree that pure conceptual work is completely doomed but I’m at least sympathetic. However, I think my point still stands: I think someone who can do conceptual+empirical work will probably have more impact doing that than not thinking about the conceptual side and just working really hard on conceptual work.
They may find some other avenue of empirical work that can help with alignment. I think probably there exist empirical avenues substantially more valuable for alignment than making progress on interpretability and opening those up requires thinking about the conceptual side.
Even if they think hard about it and can’t think of anything better than conceptual+interpretability, it still seems better for an interpretability researcher to have an idea of how their work will fit into the broader picture. Even if they aren’t backchaining, this still seems more useful than just randomly doing something under the heading of interpretability.
However, I think my point still stands: I think someone who can do conceptual+empirical work will probably have more impact doing that than not thinking about the conceptual side and just working really hard on conceptual work.
(I assume that the last “conceptual” should be “empirical”.)
I agree that ’not thinking about the conceptual side” is bad. But that’s just standard science. Like, top scientists in almost any domain aren’t just thinking about their day-to-day empirical research, they have broader opinions about the field as a whole, and more speculative and philosophical ideas, and so on. The difference is whether they treat those ideas as outputs in their own right, versus as inputs that feed into some empirical or theoretical output. Most scientists do the latter; when people in alignment talk about “conceptual work” my impression is that they’re typically thinking about the former.
I agree that people who could do either good interpretability or conceptual work should focus on conceptual work. Also, to be clear the rest of this comment is not necessarily a defence of doing interpretability work in particular, but a response to the specific kind of mental model of research you’re describing.
I think it’s important that research effort is not fungible. Interpretability has a pretty big advantage that unlike conceptual work, a) it has tight feedback loops, b) is much more paradigmatic, c) is much easier to get into for people with an ML research background.
Plausibly the most taut constraint in research is not strictly the number of researchers you can fund/train to solve a given problem—it’s hard to get researchers to do good work if they don’t feel intellectually excited about the problem, which in turn is less likely if they feel like they’re never making any progress, or feel like they are constantly unsure about what problem they’re even trying to solve.
To be clear I am not arguing that we should focus on things that are easier to solve—I am very much in favor of not just doing things that are easy to do but actually don’t help (“looking under the streetlamp”). Rather, I think what we should be doing is finding things that actually matter and making it easier for people to get excited about it (and people who are able to do this kind of work have a huge comparative advantage here!).
This seems like a false dichotomy; in general I expect that the best conceptual work will be done in close conjunction with interpretability work or other empirical work.
(In general I think that almost all attempts to do “conceptual” work that doesn’t involve either empirical results or proofs is pretty doomed. I’d be interested in any counterexamples you’ve seen; my main counterexample is threat modeling, which is why I’ve been focusing a lot on that lately.)
EDIT: many downvotes, no counterexamples. Please provide some.
I agree that doing conceptual work in conjunction with empirical work is good. I don’t know if I agree that pure conceptual work is completely doomed but I’m at least sympathetic. However, I think my point still stands: I think someone who can do conceptual+empirical work will probably have more impact doing that than not thinking about the conceptual side and just working really hard on conceptual work.
They may find some other avenue of empirical work that can help with alignment. I think probably there exist empirical avenues substantially more valuable for alignment than making progress on interpretability and opening those up requires thinking about the conceptual side.
Even if they think hard about it and can’t think of anything better than conceptual+interpretability, it still seems better for an interpretability researcher to have an idea of how their work will fit into the broader picture. Even if they aren’t backchaining, this still seems more useful than just randomly doing something under the heading of interpretability.
(I assume that the last “conceptual” should be “empirical”.)
I agree that ’not thinking about the conceptual side” is bad. But that’s just standard science. Like, top scientists in almost any domain aren’t just thinking about their day-to-day empirical research, they have broader opinions about the field as a whole, and more speculative and philosophical ideas, and so on. The difference is whether they treat those ideas as outputs in their own right, versus as inputs that feed into some empirical or theoretical output. Most scientists do the latter; when people in alignment talk about “conceptual work” my impression is that they’re typically thinking about the former.