If we want more talented people in AI safety we should focus on creating new programs. I think this might be for most readers easily defendable, but I quickly want to write my thoughts out.
Even ambitious scaling of MATS, Astra, and similar programs won’t produce enough AI safety researchers to solve the field’s core problems.
If the programs scaled aggressively, I think there are still lots of potential contributors that would not apply to them. From experience talking with ambitious and driven people who could work in AI safety, the reasons can be logistically, like not wanting to relocate, not wanting part/full-time work, financial security, etc. but they can also be subject specific, like not feeling to be in the target demographic of the program. Especially because of the latter, we should think about new programs that target possible demographics more narrowly.
Some interesting ideas: Geographically diverse programs (or initiatives that allow people doing SPAR in non-hub regions to co-work together), open sources or hacking communities (Apart and EleutherAI might already do this, but could be a lot stronger given how powerful that community is), startup/founder communities (there were a few programs like Catalyze Impact, def/acc or Seldon lab which but could a lot stronger given how important that community could be for AI safety), “finance bros,” consultants (Consultants for Impact might do this, but could maybe be stronger given that a few top contributors in AI governance have a consulting background), debaters, quant traders, historians, military history enthusiasts, OSINT community, MUN or model-NATO enthusiasts, writing communities.
These are just on top of my head, I could easily not see way more important communities that are currently not covered by existing efforts. I also wouldn’t recommend running programs in any of these areas if you don’t have a strong belief that your programs lead to participants contributing positively to AI safety.
One strong objection is that it is not clear how many potential contributors we are losing because of this. Especially for programs that target candidates that are already familiar with AI GCR, would they not apply to a fellowship if they don’t feel like part of the professional demographic?
Another reason could be that reputation might trump domain specificity. Programs that scale and exist longer are able to gain prestige or a reputation. This, as a result, unlocks reaching a demographic of applicants who care about participating in programs with prestige or reputation.
This argument also goes the other way: new programs attract a specific demographic of people who have high-trust in the community and are excited to take risks on new projects. From experience participating in Arena’s first cohort and organizing pivotals first fellowship, this might loosely correlate with “success in AI safety.” It would be interesting to look across fellowships and do an analysis of the data. If someone is interested in doing this, please reach out.
I still believe it is worth testing if new programs in any of the less explored areas could unlock a new stream of talent in AI safety that we otherwise wouldn’t have. If you are excited about doing this please reach out!
Finally, scaling existing programs is also clearly part of the AI safety field growth effort and this post should be viewed as a case against doing so.
The claim (mine) that “Attracting and accommodating a wider variety of participants is one reason to think offering a wider variety of fellowships is valuable” is true, but not sufficient for your stronger claim: “If we want more talented people in AI safety we should focus on creating new programs”, which you don’t actually argue for, and which would require a more thorough accounting of relevant factors than you offer here.
Even granting that it would be good if much more people worked on AI safety questions, practically there is still a bottleneck around (paid) positions as the field is funding constrained. Thus, even if you scale these training programs now, the immediate question for most folks is what they will do after finishing the program, and if there is no near term job where they can work on AI safety, they will probably lose most of the acquired skills or/and drop out of the field again, thus nullifying the effort put into training them.
Anthropic is making a bet on automated AI safety research as their plan to transition the world safely through transformative AI. This seems increasingly soon with Claude Code and recent METR updates.
We need dedicated monitoring infrastructure for the outputs of automated AI safety research, because those are the outputs where getting it wrong has the highest expected cost.
If AI safety research was a crisp task (ref Leike), I think monitoring AIs outputs would less likely be an issue, because a clear success criteria can be more easily checked for being goodhearted or otherwise cheated.
However, solutions to fuzzy tasks are more difficult to evaluate, since the process might involve an AI coming up with new metrics and then trying to solve for them. Especially if AI comes up with ideas to improve safety in areas with little expertise/consensus, making sense of all steps within the proposed solution could be particularly challenging. This difficulty increases with superintelligence systems – some intuitions might be purely based on an AI model’s research taste.
A dire case for automated AI safety research is if the proposed solution has compounding effects. For example, if the efforts of the automated research leads to AI models proposing training the next iteration of models with, e.g. a new LLM generated constitution or a new method of selecting pre-training data. It seems obvious that we would put the new models under similar scrutiny as we do with current models, however, depending on the novelty of the new methods, I believe this could lead to new failure modes that we haven’t anticipated or considered before. Especially if a forthcoming model is generally more performant and maybe particularly in conducting AI safety research, it could be increasingly challenging to find these flaws.
To my knowledge, monitoring infrastructure is not prepared for these types of errors. For the most important tasks we hand off to LLMs, monitoring should be really extensive and gauge for multiple ways that the proposed solution will produce safe future models.
I imagine strong monitoring covering aspects like carefully evaluating individual AIs decisions in the grand scheme of a task or assessing if decisions could lead to unintended consequences. Collusion between monitor and model is an obvious risk here as well and notably, if the monitor itself is shaped by previous rounds of automated safety research, it’s subject to the same compounding problem.
A valid point is that I am just spelling out the difficulty of AI alignment and the challenges of monitoring. AI doing these tasks does not fundamentally change the problem. However, the compounding does.
When automated safety research produces methods that shape the next generation of models, errors propagate into the systems we then rely on to do more safety research. That feedback loop is what makes this different from general monitoring challenges. For automated AI safety research, I care less about jailbreaks or misuse by non-AI users, I care mostly about it resulting in safe future models.
If we want more talented people in AI safety we should focus on creating new programs. I think this might be for most readers easily defendable, but I quickly want to write my thoughts out.
Even ambitious scaling of MATS, Astra, and similar programs won’t produce enough AI safety researchers to solve the field’s core problems.
If the programs scaled aggressively, I think there are still lots of potential contributors that would not apply to them. From experience talking with ambitious and driven people who could work in AI safety, the reasons can be logistically, like not wanting to relocate, not wanting part/full-time work, financial security, etc. but they can also be subject specific, like not feeling to be in the target demographic of the program. Especially because of the latter, we should think about new programs that target possible demographics more narrowly. Some interesting ideas: Geographically diverse programs (or initiatives that allow people doing SPAR in non-hub regions to co-work together), open sources or hacking communities (Apart and EleutherAI might already do this, but could be a lot stronger given how powerful that community is), startup/founder communities (there were a few programs like Catalyze Impact, def/acc or Seldon lab which but could a lot stronger given how important that community could be for AI safety), “finance bros,” consultants (Consultants for Impact might do this, but could maybe be stronger given that a few top contributors in AI governance have a consulting background), debaters, quant traders, historians, military history enthusiasts, OSINT community, MUN or model-NATO enthusiasts, writing communities.
These are just on top of my head, I could easily not see way more important communities that are currently not covered by existing efforts. I also wouldn’t recommend running programs in any of these areas if you don’t have a strong belief that your programs lead to participants contributing positively to AI safety.
One strong objection is that it is not clear how many potential contributors we are losing because of this. Especially for programs that target candidates that are already familiar with AI GCR, would they not apply to a fellowship if they don’t feel like part of the professional demographic? Another reason could be that reputation might trump domain specificity. Programs that scale and exist longer are able to gain prestige or a reputation. This, as a result, unlocks reaching a demographic of applicants who care about participating in programs with prestige or reputation. This argument also goes the other way: new programs attract a specific demographic of people who have high-trust in the community and are excited to take risks on new projects. From experience participating in Arena’s first cohort and organizing pivotals first fellowship, this might loosely correlate with “success in AI safety.” It would be interesting to look across fellowships and do an analysis of the data. If someone is interested in doing this, please reach out.
I still believe it is worth testing if new programs in any of the less explored areas could unlock a new stream of talent in AI safety that we otherwise wouldn’t have. If you are excited about doing this please reach out! Finally, scaling existing programs is also clearly part of the AI safety field growth effort and this post should be viewed as a case against doing so.
The claim (mine) that “Attracting and accommodating a wider variety of participants is one reason to think offering a wider variety of fellowships is valuable” is true, but not sufficient for your stronger claim: “If we want more talented people in AI safety we should focus on creating new programs”, which you don’t actually argue for, and which would require a more thorough accounting of relevant factors than you offer here.
What is your opinion on this possible objection:
Even granting that it would be good if much more people worked on AI safety questions, practically there is still a bottleneck around (paid) positions as the field is funding constrained. Thus, even if you scale these training programs now, the immediate question for most folks is what they will do after finishing the program, and if there is no near term job where they can work on AI safety, they will probably lose most of the acquired skills or/and drop out of the field again, thus nullifying the effort put into training them.
Monitoring the important outputs
Anthropic is making a bet on automated AI safety research as their plan to transition the world safely through transformative AI. This seems increasingly soon with Claude Code and recent METR updates. We need dedicated monitoring infrastructure for the outputs of automated AI safety research, because those are the outputs where getting it wrong has the highest expected cost.
If AI safety research was a crisp task (ref Leike), I think monitoring AIs outputs would less likely be an issue, because a clear success criteria can be more easily checked for being goodhearted or otherwise cheated. However, solutions to fuzzy tasks are more difficult to evaluate, since the process might involve an AI coming up with new metrics and then trying to solve for them. Especially if AI comes up with ideas to improve safety in areas with little expertise/consensus, making sense of all steps within the proposed solution could be particularly challenging. This difficulty increases with superintelligence systems – some intuitions might be purely based on an AI model’s research taste.
A dire case for automated AI safety research is if the proposed solution has compounding effects. For example, if the efforts of the automated research leads to AI models proposing training the next iteration of models with, e.g. a new LLM generated constitution or a new method of selecting pre-training data. It seems obvious that we would put the new models under similar scrutiny as we do with current models, however, depending on the novelty of the new methods, I believe this could lead to new failure modes that we haven’t anticipated or considered before. Especially if a forthcoming model is generally more performant and maybe particularly in conducting AI safety research, it could be increasingly challenging to find these flaws.
To my knowledge, monitoring infrastructure is not prepared for these types of errors. For the most important tasks we hand off to LLMs, monitoring should be really extensive and gauge for multiple ways that the proposed solution will produce safe future models. I imagine strong monitoring covering aspects like carefully evaluating individual AIs decisions in the grand scheme of a task or assessing if decisions could lead to unintended consequences. Collusion between monitor and model is an obvious risk here as well and notably, if the monitor itself is shaped by previous rounds of automated safety research, it’s subject to the same compounding problem.
A valid point is that I am just spelling out the difficulty of AI alignment and the challenges of monitoring. AI doing these tasks does not fundamentally change the problem. However, the compounding does.
When automated safety research produces methods that shape the next generation of models, errors propagate into the systems we then rely on to do more safety research. That feedback loop is what makes this different from general monitoring challenges. For automated AI safety research, I care less about jailbreaks or misuse by non-AI users, I care mostly about it resulting in safe future models.