Most AI safety plans include “automating AI safety research.” There’s a need for better clarity of what it looks like.
There are at least four things that get conflated in the term “automated research”:
AI uses search to output what was already discovered (e.g. finds the solution in existing paper(s)).
AI uses search to find pieces of a solution that come together to solve a problem (hopefully in a verifiable domain / lean proof).
AI agents use existing research techniques we already know about, and apply them to a variety of new experiments. An example of AI safety research would be using insights/techniques from subliminal learning and emergent misalignment to study new dataset splits and models trained in new ways, while applying existing interpretability techniques with an auditor agent.
Getting AIs to create novel techniques that substantially improve the domain in question. This is like getting an AI to come up with a new paradigm, which may change how we even think about that research area.
For AI safety, the crux of many disagreements is whether one believes that:
3 & 4 are meaningfully different in ways that are substantially harder to get 4 than it is to get 3. Some people even seem to fail to disentangle the two and end up convinced that AIs are solving research as some singular thing.
4-level capabilities are already in the superintelligence-regime, so it defeats the purpose of using it for safety if you don’t have guarantees that it is safe.
When talking about superintelligence (the kind that, e.g. can start and grow entire large-scale businesses on its own, solve long-term complex goals like eliminating cancer and deal with any change in the world that goes beyond its initial training data), AI safety research needs novel paradigm-level breakthroughs (4) to reduce risks down to acceptable levels. Meaning that you might expect 3 to be too much within-paradigm, relatively unenlightened research.
Whether 4 is unneeded for a safe transition. Some folks seem to believe that 3 (which could be described as “relatively unenlightened” research) will be enough to align every subsequent AI, even once we are past 4.
Some folks believe that scaffold and inference compute at not much higher level of capability is all you need to get 4, and that you’ll be fine from a safety perspective because the models are currently useful for research and don’t seem misaligned.
Some seem to believe that 3 may produce good research output (within that set of possible experiments), but you will basically get slop for anything in 4 (anything truly out-of-distribution). So, the AI was put through the wringer and believes it has substantially made the next model safe, but, because it is incapable of generalizing well OOD, it fails to align a 4-level model. It has good intentions, but basically only does good safety work for 3-level models and totally fails at generating sufficient safety research techniques for aligning 4. It just slops itself to a disaster.
Even if 3 is helpful, it doesn’t end up meaningfully speeding up safety research in comparison to the pace of progress with respect to superintelligent capabilities.
4 involves the AI continually updating its weights, consolidating insights and placing neatly within its world model. 3 has some sort of disjointed world model that can’t be overcome with fancy scratchpadding and RAG (like, imagine an AI with a knowledge cutoff from 2023 and you RAG in 2026 research, it’s missing *years* of build up in its world model). 3 is suitable for following templates and interpolating within what we know, but fails to *understand* what is OOD.
Ultimately, this seems like a highly important question to clarify, since I believe it is driving many people to be optimistic about AI safety progress, at least to the point that it allows them to keep chugging along the capabilities tech tree. Having clarity on what would convince people otherwise much sooner seems important.
Most AI safety plans include “automating AI safety research.” There’s a need for better clarity of what it looks like.
There are at least four things that get conflated in the term “automated research”:
AI uses search to output what was already discovered (e.g. finds the solution in existing paper(s)).
AI uses search to find pieces of a solution that come together to solve a problem (hopefully in a verifiable domain / lean proof).
AI agents use existing research techniques we already know about, and apply them to a variety of new experiments. An example of AI safety research would be using insights/techniques from subliminal learning and emergent misalignment to study new dataset splits and models trained in new ways, while applying existing interpretability techniques with an auditor agent.
Getting AIs to create novel techniques that substantially improve the domain in question. This is like getting an AI to come up with a new paradigm, which may change how we even think about that research area.
For AI safety, the crux of many disagreements is whether one believes that:
3 & 4 are meaningfully different in ways that are substantially harder to get 4 than it is to get 3. Some people even seem to fail to disentangle the two and end up convinced that AIs are solving research as some singular thing.
4-level capabilities are already in the superintelligence-regime, so it defeats the purpose of using it for safety if you don’t have guarantees that it is safe.
When talking about superintelligence (the kind that, e.g. can start and grow entire large-scale businesses on its own, solve long-term complex goals like eliminating cancer and deal with any change in the world that goes beyond its initial training data), AI safety research needs novel paradigm-level breakthroughs (4) to reduce risks down to acceptable levels. Meaning that you might expect 3 to be too much within-paradigm, relatively unenlightened research.
Whether 4 is unneeded for a safe transition. Some folks seem to believe that 3 (which could be described as “relatively unenlightened” research) will be enough to align every subsequent AI, even once we are past 4.
Some folks believe that scaffold and inference compute at not much higher level of capability is all you need to get 4, and that you’ll be fine from a safety perspective because the models are currently useful for research and don’t seem misaligned.
Some seem to believe that 3 may produce good research output (within that set of possible experiments), but you will basically get slop for anything in 4 (anything truly out-of-distribution). So, the AI was put through the wringer and believes it has substantially made the next model safe, but, because it is incapable of generalizing well OOD, it fails to align a 4-level model. It has good intentions, but basically only does good safety work for 3-level models and totally fails at generating sufficient safety research techniques for aligning 4. It just slops itself to a disaster.
Even if 3 is helpful, it doesn’t end up meaningfully speeding up safety research in comparison to the pace of progress with respect to superintelligent capabilities.
4 involves the AI continually updating its weights, consolidating insights and placing neatly within its world model. 3 has some sort of disjointed world model that can’t be overcome with fancy scratchpadding and RAG (like, imagine an AI with a knowledge cutoff from 2023 and you RAG in 2026 research, it’s missing *years* of build up in its world model). 3 is suitable for following templates and interpolating within what we know, but fails to *understand* what is OOD.
Ultimately, this seems like a highly important question to clarify, since I believe it is driving many people to be optimistic about AI safety progress, at least to the point that it allows them to keep chugging along the capabilities tech tree. Having clarity on what would convince people otherwise much sooner seems important.