Any research rebalances the mix of currently legible research directions that could be handed off to AI-assisted alignment researchers or early autonomous AI researchers whenever they show up. Even hopelessly incomplete research agendas could still be used to prompt future capable AI to focus on them, while in the absence of such incomplete research agendas we’d need to rely on AI’s judgment more completely. So it makes sense to still prioritize things that have no hope at all of becoming practical for decades (with human effort), to make as much partial progress as possible in developing (and deconfusing) them in the next few years.
In this sense current human research, however far from practical usefulness, forms the data for alignment of the early AI-assisted or AI-driven alignment research efforts. The judgment of human alignment researchers who are currently working makes it possible to formulate more knowably useful prompts for future AIs that nudge them in the direction of actually developing practical alignment techniques.
This sort of approach doesn’t make so much sense for research explicitly aiming at changing the dynamics in this critical period. Having an alternative, safer idea almost ready-to-go (with some explicit support from some fraction of the AI safety community) is a lot different from having some ideas which the AI could elaborate.
With AI assistance, the degree to which an alternative is ready-to-go can differ a lot compared to its prior human-developed state. Also, an idea that’s ready-to-go is not yet an edifice of theory and software that’s ready-to-go in replacing 5e28 FLOPs transformer models, so some level of AI assistance is still necessary with 2 year timelines. (I’m not necessarily arguing that 2 year timelines are correct, but it’s the kind of assumption that my argument should survive.)
The critical period includes the time when humans are still in effective control of the AIs, or when vaguely aligned and properly incentivised AIs are in control and are actually trying to help with alignment, even if their natural development and increasing power would end up pushing them out of that state soon thereafter. During this time, the state of current research culture shapes the path-dependent outcomes. Superintelligent AIs that are reflectively stable will no longer allow path dependence in their further development, but before that happens the dynamics can be changed to an arbitrary extent, especially with AI efforts as leverage in implementing the changes in practice.
in the absence of such incomplete research agendas we’d need to rely on AI’s judgment more completely
This is a key insight and I think that operationalising or pinning down the edges of a new research area is one of the longest time-horizon projects there is. If the METR estimate is accurate, then developing research directions is a distinct value-add even after AI research is semi-automatable.
Any research rebalances the mix of currently legible research directions that could be handed off to AI-assisted alignment researchers or early autonomous AI researchers whenever they show up. Even hopelessly incomplete research agendas could still be used to prompt future capable AI to focus on them, while in the absence of such incomplete research agendas we’d need to rely on AI’s judgment more completely. So it makes sense to still prioritize things that have no hope at all of becoming practical for decades (with human effort), to make as much partial progress as possible in developing (and deconfusing) them in the next few years.
In this sense current human research, however far from practical usefulness, forms the data for alignment of the early AI-assisted or AI-driven alignment research efforts. The judgment of human alignment researchers who are currently working makes it possible to formulate more knowably useful prompts for future AIs that nudge them in the direction of actually developing practical alignment techniques.
I haven’t heard this said explicitly before but it helps me understand your priorities a lot better.
Okay, this prompted me to turn the comment into a post, maybe this point is actually new to someone.
This sort of approach doesn’t make so much sense for research explicitly aiming at changing the dynamics in this critical period. Having an alternative, safer idea almost ready-to-go (with some explicit support from some fraction of the AI safety community) is a lot different from having some ideas which the AI could elaborate.
With AI assistance, the degree to which an alternative is ready-to-go can differ a lot compared to its prior human-developed state. Also, an idea that’s ready-to-go is not yet an edifice of theory and software that’s ready-to-go in replacing 5e28 FLOPs transformer models, so some level of AI assistance is still necessary with 2 year timelines. (I’m not necessarily arguing that 2 year timelines are correct, but it’s the kind of assumption that my argument should survive.)
The critical period includes the time when humans are still in effective control of the AIs, or when vaguely aligned and properly incentivised AIs are in control and are actually trying to help with alignment, even if their natural development and increasing power would end up pushing them out of that state soon thereafter. During this time, the state of current research culture shapes the path-dependent outcomes. Superintelligent AIs that are reflectively stable will no longer allow path dependence in their further development, but before that happens the dynamics can be changed to an arbitrary extent, especially with AI efforts as leverage in implementing the changes in practice.
This is a key insight and I think that operationalising or pinning down the edges of a new research area is one of the longest time-horizon projects there is. If the METR estimate is accurate, then developing research directions is a distinct value-add even after AI research is semi-automatable.