Anthropic is making a bet on automated AI safety research as their plan to transition the world safely through transformative AI. This seems increasingly soon with Claude Code and recent METR updates.
We need dedicated monitoring infrastructure for the outputs of automated AI safety research, because those are the outputs where getting it wrong has the highest expected cost.
If AI safety research was a crisp task (ref Leike), I think monitoring AIs outputs would less likely be an issue, because a clear success criteria can be more easily checked for being goodhearted or otherwise cheated.
However, solutions to fuzzy tasks are more difficult to evaluate, since the process might involve an AI coming up with new metrics and then trying to solve for them. Especially if AI comes up with ideas to improve safety in areas with little expertise/consensus, making sense of all steps within the proposed solution could be particularly challenging. This difficulty increases with superintelligence systems – some intuitions might be purely based on an AI model’s research taste.
A dire case for automated AI safety research is if the proposed solution has compounding effects. For example, if the efforts of the automated research leads to AI models proposing training the next iteration of models with, e.g. a new LLM generated constitution or a new method of selecting pre-training data. It seems obvious that we would put the new models under similar scrutiny as we do with current models, however, depending on the novelty of the new methods, I believe this could lead to new failure modes that we haven’t anticipated or considered before. Especially if a forthcoming model is generally more performant and maybe particularly in conducting AI safety research, it could be increasingly challenging to find these flaws.
To my knowledge, monitoring infrastructure is not prepared for these types of errors. For the most important tasks we hand off to LLMs, monitoring should be really extensive and gauge for multiple ways that the proposed solution will produce safe future models.
I imagine strong monitoring covering aspects like carefully evaluating individual AIs decisions in the grand scheme of a task or assessing if decisions could lead to unintended consequences. Collusion between monitor and model is an obvious risk here as well and notably, if the monitor itself is shaped by previous rounds of automated safety research, it’s subject to the same compounding problem.
A valid point is that I am just spelling out the difficulty of AI alignment and the challenges of monitoring. AI doing these tasks does not fundamentally change the problem. However, the compounding does.
When automated safety research produces methods that shape the next generation of models, errors propagate into the systems we then rely on to do more safety research. That feedback loop is what makes this different from general monitoring challenges. For automated AI safety research, I care less about jailbreaks or misuse by non-AI users, I care mostly about it resulting in safe future models.
Monitoring the important outputs
Anthropic is making a bet on automated AI safety research as their plan to transition the world safely through transformative AI. This seems increasingly soon with Claude Code and recent METR updates. We need dedicated monitoring infrastructure for the outputs of automated AI safety research, because those are the outputs where getting it wrong has the highest expected cost.
If AI safety research was a crisp task (ref Leike), I think monitoring AIs outputs would less likely be an issue, because a clear success criteria can be more easily checked for being goodhearted or otherwise cheated. However, solutions to fuzzy tasks are more difficult to evaluate, since the process might involve an AI coming up with new metrics and then trying to solve for them. Especially if AI comes up with ideas to improve safety in areas with little expertise/consensus, making sense of all steps within the proposed solution could be particularly challenging. This difficulty increases with superintelligence systems – some intuitions might be purely based on an AI model’s research taste.
A dire case for automated AI safety research is if the proposed solution has compounding effects. For example, if the efforts of the automated research leads to AI models proposing training the next iteration of models with, e.g. a new LLM generated constitution or a new method of selecting pre-training data. It seems obvious that we would put the new models under similar scrutiny as we do with current models, however, depending on the novelty of the new methods, I believe this could lead to new failure modes that we haven’t anticipated or considered before. Especially if a forthcoming model is generally more performant and maybe particularly in conducting AI safety research, it could be increasingly challenging to find these flaws.
To my knowledge, monitoring infrastructure is not prepared for these types of errors. For the most important tasks we hand off to LLMs, monitoring should be really extensive and gauge for multiple ways that the proposed solution will produce safe future models. I imagine strong monitoring covering aspects like carefully evaluating individual AIs decisions in the grand scheme of a task or assessing if decisions could lead to unintended consequences. Collusion between monitor and model is an obvious risk here as well and notably, if the monitor itself is shaped by previous rounds of automated safety research, it’s subject to the same compounding problem.
A valid point is that I am just spelling out the difficulty of AI alignment and the challenges of monitoring. AI doing these tasks does not fundamentally change the problem. However, the compounding does.
When automated safety research produces methods that shape the next generation of models, errors propagate into the systems we then rely on to do more safety research. That feedback loop is what makes this different from general monitoring challenges. For automated AI safety research, I care less about jailbreaks or misuse by non-AI users, I care mostly about it resulting in safe future models.