I’ve heard people say we should deprioritise fundamental & mechanistic interpretability[1] in short-timelines (automated AI R&D) worlds. This seems not obvious to me.
The usual argument is
Fundamental interpretability will take many years or decades until we “solve interpretability” and the research bears fruits.
Timelines are short, we don’t have many years or even decades.
Thus we won’t solve interpretability in time.
But this forgets that automated AI R&D means we’ll have decades of subjective research-time in months or years of wall-clock time! (This is why timelines would be short in the first place.) I expect mechanistic interpretability research to be about as automatable as other AI R&D, maybe even more automatable because it’s less bottlenecked by hardware. Thus interpretability progress should speed up by a similar factor as capabilities, cancelling out the shorter timelines argument.
So short timelines (via automated AI R&D) by themselves don’t make interpretability less useful.
In this shortform I’m not necessarily arguing for interpretability (that’ll be a later post—interpretability mightbequitehard), and I also want to acknowledge that some agendas (e.g. AI control, evals & governance) become more important in short timelines worlds.
I’m talking about the new fundamental approaches (computation in superposition, new toy models, new decomposition ideas etc.), specifically not about scaling existing sparse dictionary learning methods like SAEs/CLTs. ↩︎
That’s basically my argument in this post, and it applies to most AI risk related activities that naively would need to proceed much further than likely actually possible before takeoff. So not just more careful kinds of interpretability, but all sorts of things, such as control-enhancing automated governance of AI bureaucracies, or agent foundations / decision theory, or saner definitions of potential eutopias.
That is, shortly before takeoff AIs might be able to prioritize and complete your project, but only if you aim them at the particular things you’ve worked out so far. You can only start asking the right questions (without relying on AIs asking these questions themselves) by already being deconfused enough through previous human effort.
But this forgets that automated AI R&D means we’ll have decades of subjective research-time in months or years of wall-clock time!
It’s only the AIs that straightforwardly get the decades of subjective research-time, while we don’t. Humans would have to struggle to understand what the AIs are developing, in order to have a chance of meaningfully directing their efforts, while to a large extent being at the mercy of AI advice about how to think about what’s going on.
Thanks for flagging this, I missed that post! The advice in the post & its comments are very useful, especially considerations like preparing to aim the AIs, setting oneself up to provide oversight to many AI agents, and whether we’ll understand what the AIs are developing.
I think in short timelines we should relatively deprioritise longer term research bets because:
Clearly these bets are strictly better in long timelines where you both get the AI R&D acceleration and the time leading up to this.
It’s unclear how useful work done in advance will be for providing a head start to AIs automating AI R&D. E.g., maybe if things go well these AIs reinvent all prior progress very quickly and so the key thing is getting the meta-level process of this automation to work. One key question is whether the earlier work is key for supervising AIs or helps get tons of value from relatively weak AIs.
It’s unclear when automated AI R&D yields large accelerates relative to when risk emerges and AIs might be scheming and trying to sandbag at the first point when they would otherwise be very useful. (So, easier to set up methods which aims to ensure these AIs don’t sabotage work seems relatively more useful in short timelines.)
I think you probably agree that short timelines make longer term research bets some amount less useful (as you noted some other agendas become more important in short timelines). So, this is ultimately a quantitative question. I feel tempted by a perspective in which research that fully depends on massive serial acceleration gets deprioritized by ~3x conditional on short timelines (maybe <3 years) relative to >10 year timelines due to these factors.
To the extent AIs are not already robustly aligned superintelligences, the priorities they might put on AI risk related projects of their own initiative might be suboptimal for our purposes. If humans already have their R&D priorities straight (based on previous humans-substantially-in-the-loop research), they might be able to keep the AIs working on the right things, even if the AIs don’t have sufficient propensity to go there spontaneously.
I’ve heard people say we should deprioritise fundamental & mechanistic interpretability[1] in short-timelines (automated AI R&D) worlds. This seems not obvious to me.
The usual argument is
Fundamental interpretability will take many years or decades until we “solve interpretability” and the research bears fruits.
Timelines are short, we don’t have many years or even decades.
Thus we won’t solve interpretability in time.
But this forgets that automated AI R&D means we’ll have decades of subjective research-time in months or years of wall-clock time! (This is why timelines would be short in the first place.) I expect mechanistic interpretability research to be about as automatable as other AI R&D, maybe even more automatable because it’s less bottlenecked by hardware. Thus interpretability progress should speed up by a similar factor as capabilities, cancelling out the shorter timelines argument.
So short timelines (via automated AI R&D) by themselves don’t make interpretability less useful.
In this shortform I’m not necessarily arguing for interpretability (that’ll be a later post—interpretability might be quite hard), and I also want to acknowledge that some agendas (e.g. AI control, evals & governance) become more important in short timelines worlds.
I’m talking about the new fundamental approaches (computation in superposition, new toy models, new decomposition ideas etc.), specifically not about scaling existing sparse dictionary learning methods like SAEs/CLTs. ↩︎
That’s basically my argument in this post, and it applies to most AI risk related activities that naively would need to proceed much further than likely actually possible before takeoff. So not just more careful kinds of interpretability, but all sorts of things, such as control-enhancing automated governance of AI bureaucracies, or agent foundations / decision theory, or saner definitions of potential eutopias.
That is, shortly before takeoff AIs might be able to prioritize and complete your project, but only if you aim them at the particular things you’ve worked out so far. You can only start asking the right questions (without relying on AIs asking these questions themselves) by already being deconfused enough through previous human effort.
It’s only the AIs that straightforwardly get the decades of subjective research-time, while we don’t. Humans would have to struggle to understand what the AIs are developing, in order to have a chance of meaningfully directing their efforts, while to a large extent being at the mercy of AI advice about how to think about what’s going on.
Thanks for flagging this, I missed that post! The advice in the post & its comments are very useful, especially considerations like preparing to aim the AIs, setting oneself up to provide oversight to many AI agents, and whether we’ll understand what the AIs are developing.
I think in short timelines we should relatively deprioritise longer term research bets because:
Clearly these bets are strictly better in long timelines where you both get the AI R&D acceleration and the time leading up to this.
It’s unclear how useful work done in advance will be for providing a head start to AIs automating AI R&D. E.g., maybe if things go well these AIs reinvent all prior progress very quickly and so the key thing is getting the meta-level process of this automation to work. One key question is whether the earlier work is key for supervising AIs or helps get tons of value from relatively weak AIs.
It’s unclear when automated AI R&D yields large accelerates relative to when risk emerges and AIs might be scheming and trying to sandbag at the first point when they would otherwise be very useful. (So, easier to set up methods which aims to ensure these AIs don’t sabotage work seems relatively more useful in short timelines.)
I think you probably agree that short timelines make longer term research bets some amount less useful (as you noted some other agendas become more important in short timelines). So, this is ultimately a quantitative question. I feel tempted by a perspective in which research that fully depends on massive serial acceleration gets deprioritized by ~3x conditional on short timelines (maybe <3 years) relative to >10 year timelines due to these factors.
To the extent AIs are not already robustly aligned superintelligences, the priorities they might put on AI risk related projects of their own initiative might be suboptimal for our purposes. If humans already have their R&D priorities straight (based on previous humans-substantially-in-the-loop research), they might be able to keep the AIs working on the right things, even if the AIs don’t have sufficient propensity to go there spontaneously.