Aren’t the central example of founders in AI Safety the people who founded Anthropic, OpenAI and arguably Deepmind? Right after that Mechanize comes to mind.
I am not fully sure what you mean by founders, but it seems to me that the best organizations were founded by people who also wrote a lot, and generally developed a good model of the problems in parallel to running an organization. Even this isn’t a great predictor. I don’t really know what is. It seems like generally working in the space is just super high variance.
To be clear, overall I do think many more people should found organization, but the arguments in this post seem really quite weak. The issue is really not that otherwise we “can’t scale the AI Safety field”. If anything it goes the other way around! If you just want to scale the AI safety field, go work at one of the existing big organizations like Anthropic, or Deepmind, or Far Labs or whatever. They can consume tons of talent, and you can probably work with them on capturing more talent (of course, I think the consequences of doing so for many of those orgs would be quite bad, but you don’t seem to think so).
Also, to expand some more on your coverage of counterarguments:
If outreach funnels attract a large number of low-caliber talent to AI safety, we can enforce high standards for research grants and second-stage programs like ARENA and MATS.
No, you can’t, because the large set of people you are trying to “filter out” will now take an adversarial stance towards you as they are not getting the resources they think they deserve from the field. This reduces the signal-to-noise ratio of almost all channels of talent evaluation, and in the worst case produces quite agentic groups of people actively trying to worsen the judgement of the field in order to gain entry.
I happen to have written a lot about this this week: Paranoia: A Beginner’s Guide for example has an explanation of lemons markets that applies straightforwardly to grant evaluations and program applications.
This is a thing that has happened all over the place, see for example the pressures on elite universities to drop admission standards and continue grade inflation by the many people that are now part of the university system, but wouldn’t have been in previous decades.
Summoning adversaries, especially ones that have built an identity around membership in your group should be done very carefully. See also Tell people as early as possible it’s not going to work out, which I also happen to have published this week.
subsequently, frontier AI companies grew 2-3x/year, apparently unconcerned by dilution.
Yes, and this was of course, quite bad for the world? I don’t know, maybe you are trying to model AI safety as some kind of race between AI Safety and the labs, but I think this largely fails to model the state of the field.
Like, again, man, do you really think the world would be at all different in terms of our progress on safety if everyone who works on whatever applied safety is supposedly so scalable had just never worked there? Kimi K2 is basically as aligned and as likely to be safe when scaled to superintelligence as whatever Anthropic is cooking up today. The most you can say is that safety researchers have been succeeding at producing evidence about the difficulty of alignment, but of course that progress has been enormously set back by all the safety researchers working at the frontier labs which the “scaling of the field” is just shoveling talent into, which has pressured huge numbers of people to drastically understate the difficulty and risks from AI.
Many successful AI safety founders work in research-heavy roles (e.g., Buck Shlegeris, Beth Barnes, Adam Gleave, Dan Hendrycks, Marius Hobbhahn, Owain Evans, Ben Garfinkel, Eliezer Yudkowsky, Nate Soares) and the status ladder seems to reward technical prestige over building infrastructure.
I mean, and many of them don’t! CEA has not been lead by people with research experience for many years, and man I would give so much to have ended up in a world that went differently. IMO Open Phil’s community building has deeply suffered from a lack of situational awareness and strategic understanding of AI, and so massively dropped the ball. I think MATS’s biggest problem is roughly that approximately no one on the staff is a great researcher yourself, or even attempts to do any kind of the work you try to cultivate, which makes it much harder for you to steer the program.
Like, I am again all in favor of people starting more organizations, but man, we just need to understand that we don’t have the forces of the market on our side, and this means the premium we get for having people steer the organizations who have their own internal feedback loop and their own strategic map of the situation, which requires actively engaging with the core problems of the field, is much greater than it is in YC and the open market. The default outcome if you encourage young people to start an org in “AI Safety” is to just end up with someone making a bunch of vaguely safety-adjacent RL environments that get sold to big labs, that my guess is make things largely worse (I am not confident in this, but I am pretty confident it doesn’t make things much better).
And so what I am most excited about are people who do have good strategic takes starting organizations, and to demonstrate that they have that, and to develop the necessary skills, they need to write and publish publicly (or at least receive mentorship from someone who does have for a substantial period of time).
I consider the central examples of successful AI safety org founders to be Redwood, METR, Transluce, GovAI, Apollo, FAR AI, MIRI, LawZero, Pattern Labs, CAIS, Goodfire, Palisade, BlueDot, Constellation, MATS, Horizon, etc. Broader-focus orgs like 80,000 Hours, Lightcone, CEA and others have also had large impact. Apologies to all those I’ve missed!
I definitely think founders should workshop their ideas a lot, but this is not necessarily the same thing as publishing original research or writing on forums. Caveat: research org founders often should be leading research papers.
I don’t think that a great founder will have more impact in scaling the AI safety research field by working at “Anthropic, GDM, or FAR Labs” relative to founding a new research org or training program.
Maybe I’m naive about how easy it is to adjust standards for grantmakers or training programs. My experience with MATS, LISA, and Manifund has involved a lot of selection and the bar at MATS has raised every program for 4 years now, but I don’t feel a lot of pressure from rejected applicants to lower our standards. Maybe this will come with time? Or maybe it’s an ecosystem-wide effect? I see the pressure to increase elite university admissions pressure as unideal, but not a field-killer; plus, AI safety seems farfrom this point. I acknowledge that you have a lot of experience with LTFF and other selection processes.
I don’t think AI companies scaling 2-3x/year is good for the world. I do think AI safety talent failing to keep up is bad for the world. It’s not so much an adversarial dynamic as a race to lower the alignment tax as much as possible at every stage.
I don’t think that Anthropic’s safety work is zero value. I’d like to see more people working on ASL-4,5 safety at Anthropic and Kimi, all else equal. I’d also like to see more AI safety training programs supplying talent, nonprofits orgs scaling auditing and research, and advocacy orgs shifting public perception.
I’m not sure how to think about CEA (and I lack your information here), but my first reaction is not “CEA should have been led by researchers.” I also don’t think Open Phil is a good example of an org that lacked researchers; some of the best worldview investigations research imo came from Open Phil staff or affiliates, including Joe Carlsmith, Ajeya Cotra, Holden Karnofsky, Carl Schulman, etc. (edit: which clearly informed OP grant making).
I’m more optimistic than you about the impact of encouraging more AI safety founders. I’m particularly excited by Halcyon Future’s work in helping launch Goodfire, AIUC, Lucid Computing, Transluce, Seismic, AVERI, Fathom, etc. To date, I know of only two such RL dataset startups that spawned via AI safety (Mechanize, Calaveras) in contrast to ~150 AI safety-promoting orgs (though I’m sure there are other examples of AI safety-detracting startups).
I fully endorse more potential founders writing up pitches or theories of change for discussion on LW or founder networks! I think this can only strengthen their impact.
the bar at MATS has raised every program for 4 years now
What?! Something terrible must be going on in your mechanisms for evaluating people (which to be clear, isn’t surprising, indeed, you are the central target of the optimization that is happening here, but like, to me it illustrates the risks here quite cleanly).
It is very very obvious to me that median MATS participant quality has gone down continuously for the last few cohorts. I thought this was somewhat clear to y’all and you thought it was worth the tradeoff of having bigger cohorts, but you thinking it has “gone up continuously” shows a huge disconnect.
Like, these days at the end of a MATS program half of the people couldn’t really tell you why AI might be an existential risk at all. Their eyes glaze over when you try to talk about AI strategy. IDK, maybe these people are better ML researchers, but obviously they are worse contributors to the field than the people in the early cohorts.
Yeah, I mean, I do think I am a lot more pessimistic about all of these. If you want we can make a bet on how well things have played out with these in 5 years, deferring to some small panel of trusted third party people.
To date, I know of only two such RL dataset startups that spawned via AI safety
Agree. Making RL environments/datasets has only very recently become a highly profitable thing, so you shouldn’t expect much! I am happy to make bets that we will see many more in the next 1-2 years.
The MATS acceptance rate was 33% in Summer 2022 (the first program with open applications) and decreased to 4.3% (in terms of first-stage applicants; ~7% if you only count those who completed all stages) in Summer 2025. Similarly, our mentor acceptance rate decreased from 100% in Summer 2022 to 27% for the upcoming Winter 2026 Program.
I don’t have plots prepared, but measures of scholar technical ability (e.g., mentor ratings, placements, CodeSignal score) have consistently increased. I feel very confident that MATS is consistently improving in our ability to find, train, and place ML (and other) researchers in AI safety roles, predominantly as “Iterators”. Also, while the fraction of the cohort that display strong “Connector” disposition seems to have decreased over time, I think that the raw number of strong Connectors has generally increased with program size due to our research diversity metric in mentor selection. I would argue that the phenomenon you are witnessing is an increasing pivot from more theoretical to empirical AI safety mentors and research agendas.
Based on my personal experience, I think the claim “half of MATS couldn’t tell you why AI might be an existential risk” is incorrect. I can’t speak to how MATS scholars have engaged with you on AI strategy, but I would bet that the average MATS scholar today spends a lot more time on ML experiments than reading AI safety strategy docs compared to three years ago. To be clear, I think this is a good thing! I respect your disagreement here. MATS has tried to run AI safety strategy workshops and readinggroups many times in the past, but this has generally had low engagement relative to our seminar series (which features some prominent AI safety strategists anyways). If you have great ideas for how to better structure strategy workshops or generate interest, I would love to hear! (We are currently brainstorming this.)
The MATS acceptance rate was 33% in Summer 2022 (the first program with open applications) and decreased to 4.3% (in terms of first-stage applicants; ~7% if you only count those who completed all stages) in Summer 2025. Similarly, our mentor acceptance rate decreased from 100% in Summer 2022 to 27% for the upcoming Winter 2026 Program.
I mean, in as much as one is worried about Goodhart’s law, and the issue in contention is adversarial selection, then the acceptance rate going down over time is kind of the premise of the conversation. Like, it would be evidence against my model of the situation if the acceptance rate had been going up (since that would imply MATS is facing less adversarial pressure over time).
I don’t have plots prepared, but measures of scholar technical ability (e.g., mentor ratings, placements, CodeSignal score) have consistently increased. I feel very confident that MATS is consistently improving in our ability to find, train, and place ML (and other) researchers in AI safety roles, predominantly as “Iterators”.
Mentor ratings is the most interesting category to me. As you can imagine I don’t care much for ML skill at the margin. CodeSignal is a bit interesting though I am not familiar enough with it to interpret it, but I might look into it.
I don’t know whether you have any plots of mentor ratings over time broken out by individual mentor. My best guess is the reason why mentor ratings are going up is because you have more mentors who are looking for basically just ML skill, and you have successfully found a way to connect people into ML roles.
This is of course where most of your incentive gradient was pointing to in the first place, as of course the entities that are just trying to hire ML researchers have the most resources, and you will get the most applicants for highly paid industry ML roles, which are currently among the most prestigious and most highly paid roles in the world (while of course being centrally responsible for the risk from AI that we are working on).
In regards to adversarial selection, we can compare MATS to SPAR. SPAR accepted ~300 applicants in their latest batch, ~3x MATS (it’s easier to scale if you’re remote, don’t offer stipends, and allow part-timers). I would bet that the average research impact of SPAR participants is significantly lower than that of MATS, though there might be plenty of confounders here. It might be worth doing a longitudinal study here comparing various training programs’ outcomes over time, including PIBBSS, ERA, etc.
I think your read of the situation re. mentor ratings is basically correct: increasingly many MATS mentors primarily care about research execution ability (generally ML), not AI safety strategy knowledge. I see this as a feature, not a bug, but I understand why you disagree. I think you are prioritizing a different skillset than most mentors that our mentor selection committee rates highly. Interestingly, most of the technical mentors that you rate highly seem to primarily care about object-level research ability and think that strategy/research taste can be learned on the job!
Note that I think the pendulum might start to swing back towards mentors valuing high-level AI safety strategy knowledge as the Iterator archetype is increasingly replaced/supplemented by AI. The Amplifier archetype seems increasingly in-demand as orgs scale, and we might see a surge in Connectors as AI agents improve to the point that their theoretical ideas are more testable. Also note that we might have different opinions on the optimal ratio of “visionaries” vs. “experimenters” in an emerging research field.
I would bet that the average research impact of SPAR participants is significantly lower than that of MATS
I mean, sure? I am not saying your selection is worse than useless and it would be better for you to literally accept all of them, that would clearly also be bad for MATS.
I think you are prioritizing a different skillset than most mentors that our mentor selection committee rates highly. Interestingly, most of the technical mentors that you rate highly seem to primarily care about object-level research ability and think that strategy/research taste can be learned on the job!
I mean, there are obvious coordination problems here. In as much as someone is modeling MATS as a hiring pipeline, and not necessarily the one most likely to produce executive-level talent, you will have huge amounts of pressure to produce line-worker talent. This doesn’t mean the ecosystem doesn’t need executive-level talent (indeed, this post is partially about how we need more), but of course large scaling organizations create more pressure for line-working talent.
Two other issues with this paragraph:
Yes, I don’t think strategic judgement generally commutes. Most MATS mentors who I think are doing good research don’t necessarily themselves know what’s most important for the field.
I agree with the purported opinion that strategy/research taste can often be learned on the job. But I do feel very doomy about recruiting people who don’t seem to care deeply about x-risk. I would be kind of surprised if the mentors I am most excited about don’t have the same opinion, but it would be an interesting update if so!
Note that I think the pendulum might start to swing back towards mentors valuing high-level AI safety strategy knowledge as the Iterator archetype is increasingly replaced/supplemented by AI. The Amplifier archetype seems increasingly in-demand as orgs scale, and we might see a surge in Connectors as AI agents improve to the point that their theoretical ideas are more testable. Also note that we might have different opinions on the optimal ratio of “visionaries” vs. “experimenters” in an emerging research field.
I don’t particularly think these “archetypes” are real or track much of the important dimensions, so I am not really sure what you are saying here.
A few quick comments, on the same theme as but mostly unrelated to the exchange so far:
I’m not very sold on “cares about xrisk” as a key metric for technical researchers. I am more interested in people who want to very deeply understand how intelligence works (whether abstractly or in neural networks in particular). I think the former is sometimes a good proxy for the latter but it’s important not to conflate them. See this post for more.
Having said that, I don’t get much of a sense that many MATS scholars want to deeply understand how intelligence works. When I walked around the poster showcase at the most recent iteration of MATS, a large majority of the projects seemed like they’d prioritized pretty “shallow” investigations. Obviously it’s hard to complete deep scientific work in three months but at least on a quick skim I didn’t see many projects that seemed like they were even heading in that direction. (I’d cite Tom Ringstrom as one example of a MATS scholar who was trying to do deep and rigorous work, though I also think that his core assumptions are wrong.)
As one characterization of an alternative approach: my intership with Owain Evans back in 2017 consisted of me basically sitting around and thinking about AI safety for three months. I had some blog posts as output but nothing particularly legible. I think this helped nudge me towards thinking more deeply about AI safety subsequently (though it’s hard to assign specific credit).
There’s an incentive alignment problem where even if mentors want scholars to spend their time thinking carefully, the scholars’ careers will benefit most from legible projects. In my most recent MATS cohort I’ve selected for people who seem like they would be happy to just sit around and think for the whole time period without feeling much internal pressure to produce legible outputs. We’ll see how that goes.
Hmm, I was referring here to “who I would want to hire at Lightcone” (and similarly, who I expect other mentors would be interested in hiring for their orgs) where I do think I would want to hire people who are on board with that organizational mission.
At the field level, I think we probably still have some disagreement about how valuable people caring about the AI X-risk case is, but I feel a lot less strongly about it, and think I could end up pretty excited about a MATS-like program that is more oriented around doing ambitious understanding of the nature of intelligence.
As an atypical applicant to MATS (no PhD, no coding/ technical skills, not early career, new to AI), I found it incredibly difficult to find mentors who were looking to hold space for just thinking about intelligence. I’d have loved to apply to a stream that involved just thinking, writing, being challenged and repeating until I’d a thesis worth pursuing. To me, it seemed more like most mentors were looking to test very specific hypothesis, and maybe it’s for all the reasons you’ve stated above. But for someone new and inexperienced, I felt pretty unsure about applying at all.
The MATS acceptance rate was 33% in Summer 2022 (the first program with open applications) and decreased to 4.3% (in terms of first-stage applicants; ~7% if you only count those who completed all stages) in Summer 2025. Similarly, our mentor acceptance rate decreased from 100% in Summer 2022 to 27% for the upcoming Winter 2026 Program.
This is not counter-evidence to the accusation that scholar quality has been going downhill unless you add in several other assumptions.
“To be clear, I think this is a good thing! I respect your disagreement here. MATS has tried to run AI safety strategy workshops and reading groups many times in the past, but this has generally had low engagement relative to our seminar series”
I suspect that achieving high-engagement will be hard because fellows have to compete for extension funding.
I might have a special view here since I did MATS 4.0 and 8.0.
I think I met some excellent people at MATS 8.0 but would not say they are stronger than 4.0, my guess is that quality went down slightly. I remember in 4.0 a few people that impressed me quite a lot, which I saw less in 8.0. (4.0 had more very incompetent people though)
at the end of a MATS program half of the people couldn’t really tell you why AI might be an existential risk at all.
I think this is sadly somewhat true, I talked with some people in 8.0 who didn’t seem to have any particular concern with AI existential risk or seemingly never really thought about that. However, I think most people were in fact very concerned about AI existential risk. I ran a poll at some point about Eliezer’s new book and a significant minority of students seemed to have pre-ordered Eleizer’s book, which I guess is a pretty good proxy for whether someone is seriously engaging with AI X-risk.
My guess is that the recruitment process might need another variable to measure rather than academics/coding/ml experience. The kind of thing that Tim Hua (8.0 scholar) has who created an AI psychosis bench.
Also it seems to me that if you build an organization that tries to fight against the end of the world from AI, somebody should say that. Might put off some people and perhaps that should happens early. Maybe the website should say: “AI could kill literally everyone, let’s try to do something!”. And maybe the people who heard this MATS thing is good to have on their CV to apply to a PhD or a lab to land a high paying job eventually would be put off by that.
Perhaps there should also be a test where people don’t have internet access and have to answer some basic alignment questions: like why could a system that we optimize with RL develop power seeking drives? Why might training an AI create weird unpredictable preferences in an AI?
My guess at what’s happening here: for the first iterations of MATS (think MATS 2.0 at the Lightcone WeWork) you would have folks who were already into AI Safety for quite a long time and were interested in doing some form of internship-like thing for a summer. But as you run more cohorts (and make the cohorts bigger) then the density of people who have been interested in safety for a long time naturally decreases (because all the people who were interested in safety for years already applied to previous iterations).
(Derailing, What I am saying here is not central to the argument you are making here)
just end up with someone making a bunch of vaguely safety-adjacent RL environments that get sold to big labs
While I think building safety-adjacent RL envs is worse than most kinds of technical safety work for people who are very high context in AGI safety, I think it’s net positive.
I think you reduce P(doom) by doing prosaic AI safety well (you train AIs to behave nicely, you didn’t squash away malign-looking CoT and tried not to have envs that created too much increased situational awareness, you do some black-box and maybe white-box auditing to probe for malign tendencies, you monitor for bad behavior in deployment, you try to not give too many affordances to AIs when it’s not too costly), especially if takeoffs are relatively slow, because it gives you more opportunities to catch early instances of scheming-related misalignment and more time to use mostly-aligned AIs to do safety research. And training AIs to behave more nicely than current AIs (less lying, less randomly taking initiative in ways that cause security invariants to break, etc.) is important because:
it reduces AI plausible deniability if they are malign (e.g. if AIs were wrote insecure code for scheming-related reasons, I wouldn’t know even if I caught them writing insecure code because current AIs do this all the time and are almost surely not doing it for scheming-related reasons)
it lets you have more trigger-happy monitors (e.g. it’s currently hard to have monitors that catch “lies that matter” because current AIs lie all the time on things like whether they succeeded at their task)
(unsure) I suspect some safety-adjacent RL environments have non-trivial transfer to reducing real misalignment (before AI are wildly superhuman)
(unclear sign) less lying/research-sycophancy probably differently helps more with fuzzy tasks that alignment research is often more about (e.g. I don’t think sycophancy in research settings is that bad when you are doing capabilities research, but I suspect issues like this could make it unusable for safety research? Unclear)
I think the main negative effect is making AGI companies look more competent and less insanely risky than they actually are and avoiding some warning shots. I don’t know how I feel about this. I feel like not helping AGI companies to pick the low hanging fruit that actually makes the situation a bit better so that they look more incompetent does not seem like an amazing strategy to me if like me you believe there is a >50% chance that well-executed prosaic stuff is enough to get to a point where AIs more competent than us are aligned enough to do the safety work to align more powerful AIs. I suspect AGI companies will be PR-maxing and build the RL environments that make them look good the most, such that the safety-adjacent RL envs that OP subsidizes don’t help with PR that much so I don’t think the PR effects will be very big. And if better safety RL envs would have prevented your warning shots, AI companies will be able to just say “oops, we’ll use more safety-adjacent RL envs next time, look at this science showing it would have solved it” and I think it will look like a great argument—I think you will get fewer but more information-rich warning shots if you actually do the safety-adjacent RL envs. (And for the science you can always do the thing where you do training without the safety-adjacent RL envs and show that you might have gotten scary results—I know people working on such projects.)
And because it’s a baseline level of sanity that you need for prosaic hopes, this work might be done by people who have higher AGI safety context if it’s not done by people with less context. (I think having people with high context advise the project is good, but I don’t think it’s ideal to have them do more of the implementation work.)
While I think building safety-adjacent RL envs is worse than most kinds of technical safety work for people who are very high context in AGI safety, I think it’s net positive.
I think it’s a pretty high-variance activity! It’s not that I can’t imagine any kind of RL environment that might make things better, but most of them will just be used to make AIs “more helpful” and serve as generic training data to ascend the capabilities frontier.
Like, yes, there are some more interesting monitor-shaped RL environments, and I would actually be interested in digging into the details of how good or bad some of them would be, but the thing I am expecting here are more like “oh, we made a Wikipedia navigation environment, which reduces hallucinations in AI, which is totally helpful for safety I promise”, when really, I think that is just a straightforward capabilities push.
Like, yes, there are some more interesting monitor-shaped RL environments, and I would actually be interested in digging into the details of how good or bad some of them would be
As part of my startup exploration, I would like to discuss this as well. It would be helpful to clarify my thinking on whether there’s a shape of such a business that could be meaningfully positive. I’ve started reaching out to people who work in the labs to get better context on this. I think it would be good to dig deeper into Evan’s comment on the topic.
I’m going to start a Google Doc, but I would love to talk in person with folks in the Bay about this to ideate and refine it faster.
Aren’t the central example of founders in AI Safety the people who founded Anthropic, OpenAI and arguably Deepmind?
This is consistent with founders being undervalued in AI safety relative to AI capabilities. My model of Elon for instance says that a big reason towards pivoting hard towards capabilities was that all the capabilities founders were receiving more status than the safety founders.
Kimi K2 is basically as aligned and as likely to be safe when scaled to superintelligence as whatever Anthropic is cooking up today.
Sorry, I know this is tangential, but I’m curious — is it based on it being less psychosis-inducing in this investigation or are there more data points / is it known to be otherwise more aligned as well?
What do you think are ways to identify good strategic takes? This is something that seems rather fuzzy to me. It’s not clear how people are judging criteria like this or what they think is needed to improve on this.
Aren’t the central example of founders in AI Safety the people who founded Anthropic, OpenAI and arguably Deepmind? Right after that Mechanize comes to mind.
I am not fully sure what you mean by founders, but it seems to me that the best organizations were founded by people who also wrote a lot, and generally developed a good model of the problems in parallel to running an organization. Even this isn’t a great predictor. I don’t really know what is. It seems like generally working in the space is just super high variance.
To be clear, overall I do think many more people should found organization, but the arguments in this post seem really quite weak. The issue is really not that otherwise we “can’t scale the AI Safety field”. If anything it goes the other way around! If you just want to scale the AI safety field, go work at one of the existing big organizations like Anthropic, or Deepmind, or Far Labs or whatever. They can consume tons of talent, and you can probably work with them on capturing more talent (of course, I think the consequences of doing so for many of those orgs would be quite bad, but you don’t seem to think so).
Also, to expand some more on your coverage of counterarguments:
No, you can’t, because the large set of people you are trying to “filter out” will now take an adversarial stance towards you as they are not getting the resources they think they deserve from the field. This reduces the signal-to-noise ratio of almost all channels of talent evaluation, and in the worst case produces quite agentic groups of people actively trying to worsen the judgement of the field in order to gain entry.
I happen to have written a lot about this this week: Paranoia: A Beginner’s Guide for example has an explanation of lemons markets that applies straightforwardly to grant evaluations and program applications.
This is a thing that has happened all over the place, see for example the pressures on elite universities to drop admission standards and continue grade inflation by the many people that are now part of the university system, but wouldn’t have been in previous decades.
Summoning adversaries, especially ones that have built an identity around membership in your group should be done very carefully. See also Tell people as early as possible it’s not going to work out, which I also happen to have published this week.
Yes, and this was of course, quite bad for the world? I don’t know, maybe you are trying to model AI safety as some kind of race between AI Safety and the labs, but I think this largely fails to model the state of the field.
Like, again, man, do you really think the world would be at all different in terms of our progress on safety if everyone who works on whatever applied safety is supposedly so scalable had just never worked there? Kimi K2 is basically as aligned and as likely to be safe when scaled to superintelligence as whatever Anthropic is cooking up today. The most you can say is that safety researchers have been succeeding at producing evidence about the difficulty of alignment, but of course that progress has been enormously set back by all the safety researchers working at the frontier labs which the “scaling of the field” is just shoveling talent into, which has pressured huge numbers of people to drastically understate the difficulty and risks from AI.
I mean, and many of them don’t! CEA has not been lead by people with research experience for many years, and man I would give so much to have ended up in a world that went differently. IMO Open Phil’s community building has deeply suffered from a lack of situational awareness and strategic understanding of AI, and so massively dropped the ball. I think MATS’s biggest problem is roughly that approximately no one on the staff is a great researcher yourself, or even attempts to do any kind of the work you try to cultivate, which makes it much harder for you to steer the program.
Like, I am again all in favor of people starting more organizations, but man, we just need to understand that we don’t have the forces of the market on our side, and this means the premium we get for having people steer the organizations who have their own internal feedback loop and their own strategic map of the situation, which requires actively engaging with the core problems of the field, is much greater than it is in YC and the open market. The default outcome if you encourage young people to start an org in “AI Safety” is to just end up with someone making a bunch of vaguely safety-adjacent RL environments that get sold to big labs, that my guess is make things largely worse (I am not confident in this, but I am pretty confident it doesn’t make things much better).
And so what I am most excited about are people who do have good strategic takes starting organizations, and to demonstrate that they have that, and to develop the necessary skills, they need to write and publish publicly (or at least receive mentorship from someone who does have for a substantial period of time).
Thanks for reading and replying! I’ll be brief:
I consider the central examples of successful AI safety org founders to be Redwood, METR, Transluce, GovAI, Apollo, FAR AI, MIRI, LawZero, Pattern Labs, CAIS, Goodfire, Palisade, BlueDot, Constellation, MATS, Horizon, etc. Broader-focus orgs like 80,000 Hours, Lightcone, CEA and others have also had large impact. Apologies to all those I’ve missed!
I definitely think founders should workshop their ideas a lot, but this is not necessarily the same thing as publishing original research or writing on forums. Caveat: research org founders often should be leading research papers.
I don’t think that a great founder will have more impact in scaling the AI safety research field by working at “Anthropic, GDM, or FAR Labs” relative to founding a new research org or training program.
Maybe I’m naive about how easy it is to adjust standards for grantmakers or training programs. My experience with MATS, LISA, and Manifund has involved a lot of selection and the bar at MATS has raised every program for 4 years now, but I don’t feel a lot of pressure from rejected applicants to lower our standards. Maybe this will come with time? Or maybe it’s an ecosystem-wide effect? I see the pressure to increase elite university admissions pressure as unideal, but not a field-killer; plus, AI safety seems far from this point. I acknowledge that you have a lot of experience with LTFF and other selection processes.
I don’t think AI companies scaling 2-3x/year is good for the world. I do think AI safety talent failing to keep up is bad for the world. It’s not so much an adversarial dynamic as a race to lower the alignment tax as much as possible at every stage.
I don’t think that Anthropic’s safety work is zero value. I’d like to see more people working on ASL-4,5 safety at Anthropic and Kimi, all else equal. I’d also like to see more AI safety training programs supplying talent, nonprofits orgs scaling auditing and research, and advocacy orgs shifting public perception.
I’m not sure how to think about CEA (and I lack your information here), but my first reaction is not “CEA should have been led by researchers.” I also don’t think Open Phil is a good example of an org that lacked researchers; some of the best worldview investigations research imo came from Open Phil staff or affiliates, including Joe Carlsmith, Ajeya Cotra, Holden Karnofsky, Carl Schulman, etc. (edit: which clearly informed OP grant making).
I’m more optimistic than you about the impact of encouraging more AI safety founders. I’m particularly excited by Halcyon Future’s work in helping launch Goodfire, AIUC, Lucid Computing, Transluce, Seismic, AVERI, Fathom, etc. To date, I know of only two such RL dataset startups that spawned via AI safety (Mechanize, Calaveras) in contrast to ~150 AI safety-promoting orgs (though I’m sure there are other examples of AI safety-detracting startups).
I fully endorse more potential founders writing up pitches or theories of change for discussion on LW or founder networks! I think this can only strengthen their impact.
What?! Something terrible must be going on in your mechanisms for evaluating people (which to be clear, isn’t surprising, indeed, you are the central target of the optimization that is happening here, but like, to me it illustrates the risks here quite cleanly).
It is very very obvious to me that median MATS participant quality has gone down continuously for the last few cohorts. I thought this was somewhat clear to y’all and you thought it was worth the tradeoff of having bigger cohorts, but you thinking it has “gone up continuously” shows a huge disconnect.
Like, these days at the end of a MATS program half of the people couldn’t really tell you why AI might be an existential risk at all. Their eyes glaze over when you try to talk about AI strategy. IDK, maybe these people are better ML researchers, but obviously they are worse contributors to the field than the people in the early cohorts.
Yeah, I mean, I do think I am a lot more pessimistic about all of these. If you want we can make a bet on how well things have played out with these in 5 years, deferring to some small panel of trusted third party people.
Agree. Making RL environments/datasets has only very recently become a highly profitable thing, so you shouldn’t expect much! I am happy to make bets that we will see many more in the next 1-2 years.
I feel actively excited about 2 of these, quite negative about 1 of them, and confused/neutral about the others.
Can you share which?
The MATS acceptance rate was 33% in Summer 2022 (the first program with open applications) and decreased to 4.3% (in terms of first-stage applicants; ~7% if you only count those who completed all stages) in Summer 2025. Similarly, our mentor acceptance rate decreased from 100% in Summer 2022 to 27% for the upcoming Winter 2026 Program.
I don’t have plots prepared, but measures of scholar technical ability (e.g., mentor ratings, placements, CodeSignal score) have consistently increased. I feel very confident that MATS is consistently improving in our ability to find, train, and place ML (and other) researchers in AI safety roles, predominantly as “Iterators”. Also, while the fraction of the cohort that display strong “Connector” disposition seems to have decreased over time, I think that the raw number of strong Connectors has generally increased with program size due to our research diversity metric in mentor selection. I would argue that the phenomenon you are witnessing is an increasing pivot from more theoretical to empirical AI safety mentors and research agendas.
Based on my personal experience, I think the claim “half of MATS couldn’t tell you why AI might be an existential risk” is incorrect. I can’t speak to how MATS scholars have engaged with you on AI strategy, but I would bet that the average MATS scholar today spends a lot more time on ML experiments than reading AI safety strategy docs compared to three years ago. To be clear, I think this is a good thing! I respect your disagreement here. MATS has tried to run AI safety strategy workshops and reading groups many times in the past, but this has generally had low engagement relative to our seminar series (which features some prominent AI safety strategists anyways). If you have great ideas for how to better structure strategy workshops or generate interest, I would love to hear! (We are currently brainstorming this.)
I mean, in as much as one is worried about Goodhart’s law, and the issue in contention is adversarial selection, then the acceptance rate going down over time is kind of the premise of the conversation. Like, it would be evidence against my model of the situation if the acceptance rate had been going up (since that would imply MATS is facing less adversarial pressure over time).
Mentor ratings is the most interesting category to me. As you can imagine I don’t care much for ML skill at the margin. CodeSignal is a bit interesting though I am not familiar enough with it to interpret it, but I might look into it.
I don’t know whether you have any plots of mentor ratings over time broken out by individual mentor. My best guess is the reason why mentor ratings are going up is because you have more mentors who are looking for basically just ML skill, and you have successfully found a way to connect people into ML roles.
This is of course where most of your incentive gradient was pointing to in the first place, as of course the entities that are just trying to hire ML researchers have the most resources, and you will get the most applicants for highly paid industry ML roles, which are currently among the most prestigious and most highly paid roles in the world (while of course being centrally responsible for the risk from AI that we are working on).
In regards to adversarial selection, we can compare MATS to SPAR. SPAR accepted ~300 applicants in their latest batch, ~3x MATS (it’s easier to scale if you’re remote, don’t offer stipends, and allow part-timers). I would bet that the average research impact of SPAR participants is significantly lower than that of MATS, though there might be plenty of confounders here. It might be worth doing a longitudinal study here comparing various training programs’ outcomes over time, including PIBBSS, ERA, etc.
I think your read of the situation re. mentor ratings is basically correct: increasingly many MATS mentors primarily care about research execution ability (generally ML), not AI safety strategy knowledge. I see this as a feature, not a bug, but I understand why you disagree. I think you are prioritizing a different skillset than most mentors that our mentor selection committee rates highly. Interestingly, most of the technical mentors that you rate highly seem to primarily care about object-level research ability and think that strategy/research taste can be learned on the job!
Note that I think the pendulum might start to swing back towards mentors valuing high-level AI safety strategy knowledge as the Iterator archetype is increasingly replaced/supplemented by AI. The Amplifier archetype seems increasingly in-demand as orgs scale, and we might see a surge in Connectors as AI agents improve to the point that their theoretical ideas are more testable. Also note that we might have different opinions on the optimal ratio of “visionaries” vs. “experimenters” in an emerging research field.
I mean, sure? I am not saying your selection is worse than useless and it would be better for you to literally accept all of them, that would clearly also be bad for MATS.
I mean, there are obvious coordination problems here. In as much as someone is modeling MATS as a hiring pipeline, and not necessarily the one most likely to produce executive-level talent, you will have huge amounts of pressure to produce line-worker talent. This doesn’t mean the ecosystem doesn’t need executive-level talent (indeed, this post is partially about how we need more), but of course large scaling organizations create more pressure for line-working talent.
Two other issues with this paragraph:
Yes, I don’t think strategic judgement generally commutes. Most MATS mentors who I think are doing good research don’t necessarily themselves know what’s most important for the field.
I agree with the purported opinion that strategy/research taste can often be learned on the job. But I do feel very doomy about recruiting people who don’t seem to care deeply about x-risk. I would be kind of surprised if the mentors I am most excited about don’t have the same opinion, but it would be an interesting update if so!
I don’t particularly think these “archetypes” are real or track much of the important dimensions, so I am not really sure what you are saying here.
A few quick comments, on the same theme as but mostly unrelated to the exchange so far:
I’m not very sold on “cares about xrisk” as a key metric for technical researchers. I am more interested in people who want to very deeply understand how intelligence works (whether abstractly or in neural networks in particular). I think the former is sometimes a good proxy for the latter but it’s important not to conflate them. See this post for more.
Having said that, I don’t get much of a sense that many MATS scholars want to deeply understand how intelligence works. When I walked around the poster showcase at the most recent iteration of MATS, a large majority of the projects seemed like they’d prioritized pretty “shallow” investigations. Obviously it’s hard to complete deep scientific work in three months but at least on a quick skim I didn’t see many projects that seemed like they were even heading in that direction. (I’d cite Tom Ringstrom as one example of a MATS scholar who was trying to do deep and rigorous work, though I also think that his core assumptions are wrong.)
As one characterization of an alternative approach: my intership with Owain Evans back in 2017 consisted of me basically sitting around and thinking about AI safety for three months. I had some blog posts as output but nothing particularly legible. I think this helped nudge me towards thinking more deeply about AI safety subsequently (though it’s hard to assign specific credit).
There’s an incentive alignment problem where even if mentors want scholars to spend their time thinking carefully, the scholars’ careers will benefit most from legible projects. In my most recent MATS cohort I’ve selected for people who seem like they would be happy to just sit around and think for the whole time period without feeling much internal pressure to produce legible outputs. We’ll see how that goes.
Hmm, I was referring here to “who I would want to hire at Lightcone” (and similarly, who I expect other mentors would be interested in hiring for their orgs) where I do think I would want to hire people who are on board with that organizational mission.
At the field level, I think we probably still have some disagreement about how valuable people caring about the AI X-risk case is, but I feel a lot less strongly about it, and think I could end up pretty excited about a MATS-like program that is more oriented around doing ambitious understanding of the nature of intelligence.
Sounds like PIBBSS/PrincInt!
As an atypical applicant to MATS (no PhD, no coding/ technical skills, not early career, new to AI), I found it incredibly difficult to find mentors who were looking to hold space for just thinking about intelligence. I’d have loved to apply to a stream that involved just thinking, writing, being challenged and repeating until I’d a thesis worth pursuing. To me, it seemed more like most mentors were looking to test very specific hypothesis, and maybe it’s for all the reasons you’ve stated above. But for someone new and inexperienced, I felt pretty unsure about applying at all.
This is not counter-evidence to the accusation that scholar quality has been going downhill unless you add in several other assumptions.
It’s not supposed to be counter-evidence in its own right. I like to present the full picture.
“To be clear, I think this is a good thing! I respect your disagreement here. MATS has tried to run AI safety strategy workshops and reading groups many times in the past, but this has generally had low engagement relative to our seminar series”
I suspect that achieving high-engagement will be hard because fellows have to compete for extension funding.
True, but we accepted 75% of all scholars into the 6-month extension last program, so the pressure might not be that large now.
What percentage applied?
I might have a special view here since I did MATS 4.0 and 8.0.
I think I met some excellent people at MATS 8.0 but would not say they are stronger than 4.0, my guess is that quality went down slightly. I remember in 4.0 a few people that impressed me quite a lot, which I saw less in 8.0. (4.0 had more very incompetent people though)
I think this is sadly somewhat true, I talked with some people in 8.0 who didn’t seem to have any particular concern with AI existential risk or seemingly never really thought about that. However, I think most people were in fact very concerned about AI existential risk. I ran a poll at some point about Eliezer’s new book and a significant minority of students seemed to have pre-ordered Eleizer’s book, which I guess is a pretty good proxy for whether someone is seriously engaging with AI X-risk.
My guess is that the recruitment process might need another variable to measure rather than academics/coding/ml experience. The kind of thing that Tim Hua (8.0 scholar) has who created an AI psychosis bench.
Also it seems to me that if you build an organization that tries to fight against the end of the world from AI, somebody should say that. Might put off some people and perhaps that should happens early. Maybe the website should say: “AI could kill literally everyone, let’s try to do something!”. And maybe the people who heard this MATS thing is good to have on their CV to apply to a PhD or a lab to land a high paying job eventually would be put off by that.
Perhaps there should also be a test where people don’t have internet access and have to answer some basic alignment questions: like why could a system that we optimize with RL develop power seeking drives? Why might training an AI create weird unpredictable preferences in an AI?
My guess at what’s happening here: for the first iterations of MATS (think MATS 2.0 at the Lightcone WeWork) you would have folks who were already into AI Safety for quite a long time and were interested in doing some form of internship-like thing for a summer. But as you run more cohorts (and make the cohorts bigger) then the density of people who have been interested in safety for a long time naturally decreases (because all the people who were interested in safety for years already applied to previous iterations).
(Derailing, What I am saying here is not central to the argument you are making here)
While I think building safety-adjacent RL envs is worse than most kinds of technical safety work for people who are very high context in AGI safety, I think it’s net positive.
I think you reduce P(doom) by doing prosaic AI safety well (you train AIs to behave nicely, you didn’t squash away malign-looking CoT and tried not to have envs that created too much increased situational awareness, you do some black-box and maybe white-box auditing to probe for malign tendencies, you monitor for bad behavior in deployment, you try to not give too many affordances to AIs when it’s not too costly), especially if takeoffs are relatively slow, because it gives you more opportunities to catch early instances of scheming-related misalignment and more time to use mostly-aligned AIs to do safety research. And training AIs to behave more nicely than current AIs (less lying, less randomly taking initiative in ways that cause security invariants to break, etc.) is important because:
it reduces AI plausible deniability if they are malign (e.g. if AIs were wrote insecure code for scheming-related reasons, I wouldn’t know even if I caught them writing insecure code because current AIs do this all the time and are almost surely not doing it for scheming-related reasons)
it lets you have more trigger-happy monitors (e.g. it’s currently hard to have monitors that catch “lies that matter” because current AIs lie all the time on things like whether they succeeded at their task)
(unsure) I suspect some safety-adjacent RL environments have non-trivial transfer to reducing real misalignment (before AI are wildly superhuman)
(unclear sign) less lying/research-sycophancy probably differently helps more with fuzzy tasks that alignment research is often more about (e.g. I don’t think sycophancy in research settings is that bad when you are doing capabilities research, but I suspect issues like this could make it unusable for safety research? Unclear)
I think the main negative effect is making AGI companies look more competent and less insanely risky than they actually are and avoiding some warning shots. I don’t know how I feel about this. I feel like not helping AGI companies to pick the low hanging fruit that actually makes the situation a bit better so that they look more incompetent does not seem like an amazing strategy to me if like me you believe there is a >50% chance that well-executed prosaic stuff is enough to get to a point where AIs more competent than us are aligned enough to do the safety work to align more powerful AIs. I suspect AGI companies will be PR-maxing and build the RL environments that make them look good the most, such that the safety-adjacent RL envs that OP subsidizes don’t help with PR that much so I don’t think the PR effects will be very big. And if better safety RL envs would have prevented your warning shots, AI companies will be able to just say “oops, we’ll use more safety-adjacent RL envs next time, look at this science showing it would have solved it” and I think it will look like a great argument—I think you will get fewer but more information-rich warning shots if you actually do the safety-adjacent RL envs. (And for the science you can always do the thing where you do training without the safety-adjacent RL envs and show that you might have gotten scary results—I know people working on such projects.)
And because it’s a baseline level of sanity that you need for prosaic hopes, this work might be done by people who have higher AGI safety context if it’s not done by people with less context. (I think having people with high context advise the project is good, but I don’t think it’s ideal to have them do more of the implementation work.)
I think it’s a pretty high-variance activity! It’s not that I can’t imagine any kind of RL environment that might make things better, but most of them will just be used to make AIs “more helpful” and serve as generic training data to ascend the capabilities frontier.
Like, yes, there are some more interesting monitor-shaped RL environments, and I would actually be interested in digging into the details of how good or bad some of them would be, but the thing I am expecting here are more like “oh, we made a Wikipedia navigation environment, which reduces hallucinations in AI, which is totally helpful for safety I promise”, when really, I think that is just a straightforward capabilities push.
As part of my startup exploration, I would like to discuss this as well. It would be helpful to clarify my thinking on whether there’s a shape of such a business that could be meaningfully positive. I’ve started reaching out to people who work in the labs to get better context on this. I think it would be good to dig deeper into Evan’s comment on the topic.
I’m going to start a Google Doc, but I would love to talk in person with folks in the Bay about this to ideate and refine it faster.
This is consistent with founders being undervalued in AI safety relative to AI capabilities. My model of Elon for instance says that a big reason towards pivoting hard towards capabilities was that all the capabilities founders were receiving more status than the safety founders.
Sorry, I know this is tangential, but I’m curious — is it based on it being less psychosis-inducing in this investigation or are there more data points / is it known to be otherwise more aligned as well?
What do you think are ways to identify good strategic takes? This is something that seems rather fuzzy to me. It’s not clear how people are judging criteria like this or what they think is needed to improve on this.