The AI safety ecosystem is so well resourced that it has been correctly identified by many as one of the best paths into high prestige AI research jobs.
This person on twitter has written a popular article about getting into frontier ai labs and a Field Guide to AI Fellowships. The “AI Fellowships” are mostly AI safety programs funded by CG/OpenPhil. I have also noticed that people in ML research are quite likely to have heard of MATS and be interested in participating, even when they have very little interest in AI safety.
Idk how good or bad this is. It definitely causes a lot of ML researchers to engage with the AI safety literature, where they otherwise would not have. But it’s worth noting that while the primary driver of applications to programs like MATS used be concern about AI safety, now it is increasingly a desire to work at a frontier lab.
I used to select participants for LASR labs, one of programs listed, and we actively tried to choose people who cared about AI safety, but I think we often did not succeed and indeed some people on the program now work at frontier labs in roles that I think have little to do with safety.
Safety-washing in practice mostly looks like people rationalizing research or jobs that they would have done anyway as necessary for safety. It’s easy to do because it is in fact genuinely unclear in most cases what is helpful and harmful for safety. Only by looking at the larger pattern can we notice the suspicious abundance of conveniently overlapping opportunities that further both safety and a person’s short-term interests. However, I think it will be less common in people whose primary motivation is to prevent AI catastrophe.
maybe i’m overconfident but i generally find that it’s somewhat easy to tell if someone genuinely cares. you can’t really fake genuinely caring (or at least as genuinely as one can about anything—perhaps everything bottoms out in some other drives, but that’s good enough for me). there are some false negatives—people who genuinely care who have taken on the affectations of faking it, because they mistakenly think this helps their chances at success. there are also some people who care so much that they go crazy and start being counterproductive. but i don’t think i’ve ever heavily overestimated how much someone cares about AI safety.
This also seems right to me, but I don’t think MATS or the labs (or the funders) care if someone actually cares. They seem happy to tell themselves stories that they are making things better by getting people into the labs (or lab-adjacent orgs) even if they really don’t seem very safety motivated.
Maybe you could make an argument that they don’t care enough, but I’m pretty confident that MATS does care about this. When I did it, they had a whole reading group program whose primary aim, as I understand it, was to get people to understand and care more about the fundamental safety issues. And I also believe they try to select for people who care about safety, at least to some extent.
I think it’s gotten worse over time. I should have said “I don’t think MATS or the labs (or the funders) care much if someone actually cares”. I agree the caring isn’t zero.
i want to spend money on making there be more good alignment researchers and care about them actually caring about alignment. what would you recommend i do?
I think up until very recently funding Lightcone would have been a good bet at your scale (though you should of course be appropriately skeptical of me saying this). Funding projects doing good object-level work seems good.
Before I think about concrete recommendations, by “alignment researchers” do you mean people doing technical alignment work, or eval work, or do you include technical governance work like MIRI’s technical governance team? Or are you just using it as a proxy for work trying to make alignment go well in some form or another?
I think this is very hard thing to get right, and I don’t think there’s a scaleable way or org. that you can spend money on.
I think the current best bet would be to find existing people that are technical, driven, safety-pilled and already done some independent research, and fund them to continue doing that?
If true, I think part of the change can be attributed to a more dramatic increase in those who are technically competent and have various accomplishments yet don’t care so much for safety (though would love a role at Anthropic). If the sample of those folks has increased, it becomes more challenging to turn down those who are highly competent but maybe care a bit less about safety. Especially if you are a mentor and you just want someone to execute exceptionally well on your project.
Last time I mentored (for SPAR), I had to turn down a lot of folks who were more accomplished and likely to succeed (e.g. professors) so that I could accept those I felt would have more long-term potential to contribute to safety.
Last time I mentored (for SPAR), I had to turn down a lot of folks who were more accomplished and likely to succeed (e.g. professors) so that I could accept those I felt would have more long-term potential to contribute to safety.
Wouldn’t professors be more likely to only apply if they genuinely care about safety, while younger people (overall) might be still figuring things out and happy for any opportunity to advance their career?
In the specific situation I’m referencing, no. My impression is that they were just curious about automated research scaffolds, agent frameworks and applying interpretability for capabilities. They did not seem all that interested in alignment.
I could imagine that in practice many professors end up optimizing for prestige and more paper publications (that have nothing to do with superintelligence), whereas a younger person may not be locked into that mindset yet.
It depends a lot on the type and quantity of interactions you have. I think it can be pretty hard to assess this from just a CV, short answer questions and a 20 minute interview.
Maybe unpopular but I think that if you’re concerned about defection after programs like MATS then even more than “caring about AI safety” you specifically want to select for effective altruists, who are much likelier to care about AI actually going well than people who specifically care about AI safety. For the latter group, the motivation is partly not wanting everyone to die but also partly being AGI-pilled, being interested in the technical problem of alignment, and following prestige gradients, none of which are going to robustly keep participants on track.
I mentor some people for AI fellowships/safety programs and worry about this a fair amount. How do you select people who care about safety (specifically: what criteria/vibes do use) and overall do you think it’s still positive in expectation to do this type of mentorship for the next 12 months?
Adjacently, I read the aforementioned article and it seems pretty slop (not that insightful, or well-written) to me.
I think successful prior career in industry could be a good sign. Or any other career, or generally being older, so that it’s less likely you thought 2 years ago that MATS would look great in your CV.
Dave Banerjee’s test is a decent first pass filter, given his observation that “a large fraction of researchers in AI safety/governance fellowships cannot do any of these things”. Like any interactive interview, you’d want to probe their mental models with a few follow-up questions.
I think this makes the right kind of mentorship and exercising the right kind of judgment more valuable. FWIW the article’s author got outed as a fabulist of some form in the last few hours; he admitted to lying about MATS and hasn’t furnished any proof about his other resume lines.
I think this is a legibility problem; the whole rationalist corpus is available for easy consumption, Bayesian reasoning is not that hard to figure out, and there’s a pretty universal presumption of good faith (the author still has a lot of defenders!), so it’s really hard to sit two people down, one of whom is a mercenary there to use the fellowship as a stepping stone to wealth, power, etc, the other there out of genuine belief, and detect which is which reliably off of conversation alone.
At least at the selection layer, it seems like the most important thing is finding small honesty and motivation tells. I’m a recent college grad and in college I helped found a club which had pretty explosive and rapid success which we believed was thanks to our distinct internal culture and norms which put us at odds with our otherwise famously-mercenary college.
In the end we found two filters to be reliable and workable. The first was a very brief screening application which explicitly prohibited AI-generated responses. Whoever read the application plugged results into Pangram if they were suspicious. Our belief was that if the applicant couldn’t do a ~15 minute screener with their own writing, then we had no reliable positive indications of their motivation or work quality, and a pretty strong negative signal about their honesty.
The next was a curveball question in which we explicitly encouraged “I don’t know, here’s what I do know and here’s how I’d start solving it” and clarifying-follow ups as answers. The questions always varied and were always premised around very specific information about our field; having an answer from the hip was very high-signal for motivation and work quality, asking the right questions was higher signal for work quality but lower for motivation.
These two filters were the only absolutes in the process, and any other information we asked (resume, specific question content, etc) was purely about placement or deciding between marginal applicants. When I graduated, the club had basically kept its original culture while being successful. Maybe this approach will be helpful for thinking through the problems?
My guess at one of the best metics / interview questions here is “tell me about a lesswrong post you really liked” or something similar. Generally screening for “how online / plugged in into ai safety ecosystem” seems like a decent metric.
This is not exactly what I want, though, since I think anyone seriously applying to these programs will have done some reading and would be able to answer about their favorite LW post competently.
I do, or at least there’s a small cluster I keep returning to, and the one I’d name first is Eliezer’s “Local Validity as a Key to Sanity and Civilization.”
The core move is simple and I find it keeps paying out: you should be able to evaluate whether a single step of reasoning is valid independently of whether you like where the argument lands. And the post’s real claim isn’t just epistemic hygiene for individuals — it’s that this habit is load-bearing for civilization. A society where people can agree “that inference is invalid” even when they disagree about the conclusion has a working immune system; one where validity-judgments get pulled toward tribal allegiance loses the ability to error-correct at all. That second-order framing — local validity as the thing that lets pluralistic systems stay sane — is what elevates it above a standard “commit no fallacies” essay.
What I like is that it’s a genuinely useful idea rather than a clever one. It gives you a concrete thing to watch for in yourself: the moment you notice you’re scrutinizing an argument harder because you dislike its conclusion, you’ve caught yourself doing the bad thing.
The honest caveat is that “favorite” for me is closer to “highest hit-rate on rereads” than to nostalgia. By that standard, the runners-up are Scott’s “The Tails Coming Apart as Metaphor for Life” and Garrabrant’s “Goodhart Taxonomy” — both for similar reasons, that they name a structural failure mode crisply enough to actually use.
Do you have one? I’d be curious whether yours skews toward the reasoning-tools cluster or somewhere else entirely.
Remove some of the obvious LLM tells, and I’d have a lot of trouble telling it apart from a fluent LW’er “genuinely” interested in civilizational issues, discourse norms, and AI safety.
Agree with Leo that this is not a hard thing to distinguish, provided that value alignment matters for these fellowships.
When we do interviews for CMU’s AI Safety org, it seems like open-ended questions about viewpoints (e.g. “what about the current pace of AI keeps you up at night?” or “if you weren’t doing AI Safety research, what would the alternative be?”) enable us not just to distinguish between people who “speak the language”. Another shibboleth is actually under-awareness of the community—someone who is quick to recite a bunch of names may be less concerned with the issues at hand. Whether we want to do this is another question.
FWIW, MATS clearly does still source people who are excited about safety; most other fellows in my cohort act as if they are fighting for their future! Still others are intrigued by the more challenging theoretical and empirical questions. I trust the staff in their experience vetting out people who do not wish to engage the space genuinely.
The thing I am more worried about are AI fellowships taking money from participants (or fellowships created to reach personal agendas of the founders).
The AI safety ecosystem is so well resourced that it has been correctly identified by many as one of the best paths into high prestige AI research jobs.
This person on twitter has written a popular article about getting into frontier ai labs and a Field Guide to AI Fellowships. The “AI Fellowships” are mostly AI safety programs funded by CG/OpenPhil. I have also noticed that people in ML research are quite likely to have heard of MATS and be interested in participating, even when they have very little interest in AI safety.
Idk how good or bad this is. It definitely causes a lot of ML researchers to engage with the AI safety literature, where they otherwise would not have. But it’s worth noting that while the primary driver of applications to programs like MATS used be concern about AI safety, now it is increasingly a desire to work at a frontier lab.
I used to select participants for LASR labs, one of programs listed, and we actively tried to choose people who cared about AI safety, but I think we often did not succeed and indeed some people on the program now work at frontier labs in roles that I think have little to do with safety.
Safety-washing in practice mostly looks like people rationalizing research or jobs that they would have done anyway as necessary for safety. It’s easy to do because it is in fact genuinely unclear in most cases what is helpful and harmful for safety. Only by looking at the larger pattern can we notice the suspicious abundance of conveniently overlapping opportunities that further both safety and a person’s short-term interests. However, I think it will be less common in people whose primary motivation is to prevent AI catastrophe.
Interestingly Pangram says that the first article you linked is 100% AI. https://x.com/pangram/status/2066399823486423185
maybe i’m overconfident but i generally find that it’s somewhat easy to tell if someone genuinely cares. you can’t really fake genuinely caring (or at least as genuinely as one can about anything—perhaps everything bottoms out in some other drives, but that’s good enough for me). there are some false negatives—people who genuinely care who have taken on the affectations of faking it, because they mistakenly think this helps their chances at success. there are also some people who care so much that they go crazy and start being counterproductive. but i don’t think i’ve ever heavily overestimated how much someone cares about AI safety.
This also seems right to me, but I don’t think MATS or the labs (or the funders) care if someone actually cares. They seem happy to tell themselves stories that they are making things better by getting people into the labs (or lab-adjacent orgs) even if they really don’t seem very safety motivated.
Maybe you could make an argument that they don’t care enough, but I’m pretty confident that MATS does care about this. When I did it, they had a whole reading group program whose primary aim, as I understand it, was to get people to understand and care more about the fundamental safety issues. And I also believe they try to select for people who care about safety, at least to some extent.
I think it’s gotten worse over time. I should have said “I don’t think MATS or the labs (or the funders) care much if someone actually cares”. I agree the caring isn’t zero.
i want to spend money on making there be more good alignment researchers and care about them actually caring about alignment. what would you recommend i do?
I think up until very recently funding Lightcone would have been a good bet at your scale (though you should of course be appropriately skeptical of me saying this). Funding projects doing good object-level work seems good.
Before I think about concrete recommendations, by “alignment researchers” do you mean people doing technical alignment work, or eval work, or do you include technical governance work like MIRI’s technical governance team? Or are you just using it as a proxy for work trying to make alignment go well in some form or another?
I think this is very hard thing to get right, and I don’t think there’s a scaleable way or org. that you can spend money on.
I think the current best bet would be to find existing people that are technical, driven, safety-pilled and already done some independent research, and fund them to continue doing that?
If true, I think part of the change can be attributed to a more dramatic increase in those who are technically competent and have various accomplishments yet don’t care so much for safety (though would love a role at Anthropic). If the sample of those folks has increased, it becomes more challenging to turn down those who are highly competent but maybe care a bit less about safety. Especially if you are a mentor and you just want someone to execute exceptionally well on your project.
Last time I mentored (for SPAR), I had to turn down a lot of folks who were more accomplished and likely to succeed (e.g. professors) so that I could accept those I felt would have more long-term potential to contribute to safety.
Wouldn’t professors be more likely to only apply if they genuinely care about safety, while younger people (overall) might be still figuring things out and happy for any opportunity to advance their career?
In the specific situation I’m referencing, no. My impression is that they were just curious about automated research scaffolds, agent frameworks and applying interpretability for capabilities. They did not seem all that interested in alignment.
I could imagine that in practice many professors end up optimizing for prestige and more paper publications (that have nothing to do with superintelligence), whereas a younger person may not be locked into that mindset yet.
It depends a lot on the type and quantity of interactions you have. I think it can be pretty hard to assess this from just a CV, short answer questions and a 20 minute interview.
Maybe unpopular but I think that if you’re concerned about defection after programs like MATS then even more than “caring about AI safety” you specifically want to select for effective altruists, who are much likelier to care about AI actually going well than people who specifically care about AI safety. For the latter group, the motivation is partly not wanting everyone to die but also partly being AGI-pilled, being interested in the technical problem of alignment, and following prestige gradients, none of which are going to robustly keep participants on track.
I mentor some people for AI fellowships/safety programs and worry about this a fair amount. How do you select people who care about safety (specifically: what criteria/vibes do use) and overall do you think it’s still positive in expectation to do this type of mentorship for the next 12 months?
Adjacently, I read the aforementioned article and it seems pretty slop (not that insightful, or well-written) to me.
I think successful prior career in industry could be a good sign. Or any other career, or generally being older, so that it’s less likely you thought 2 years ago that MATS would look great in your CV.
Dave Banerjee’s test is a decent first pass filter, given his observation that “a large fraction of researchers in AI safety/governance fellowships cannot do any of these things”. Like any interactive interview, you’d want to probe their mental models with a few follow-up questions.
I think this makes the right kind of mentorship and exercising the right kind of judgment more valuable. FWIW the article’s author got outed as a fabulist of some form in the last few hours; he admitted to lying about MATS and hasn’t furnished any proof about his other resume lines.
I think this is a legibility problem; the whole rationalist corpus is available for easy consumption, Bayesian reasoning is not that hard to figure out, and there’s a pretty universal presumption of good faith (the author still has a lot of defenders!), so it’s really hard to sit two people down, one of whom is a mercenary there to use the fellowship as a stepping stone to wealth, power, etc, the other there out of genuine belief, and detect which is which reliably off of conversation alone.
At least at the selection layer, it seems like the most important thing is finding small honesty and motivation tells. I’m a recent college grad and in college I helped found a club which had pretty explosive and rapid success which we believed was thanks to our distinct internal culture and norms which put us at odds with our otherwise famously-mercenary college.
In the end we found two filters to be reliable and workable. The first was a very brief screening application which explicitly prohibited AI-generated responses. Whoever read the application plugged results into Pangram if they were suspicious. Our belief was that if the applicant couldn’t do a ~15 minute screener with their own writing, then we had no reliable positive indications of their motivation or work quality, and a pretty strong negative signal about their honesty.
The next was a curveball question in which we explicitly encouraged “I don’t know, here’s what I do know and here’s how I’d start solving it” and clarifying-follow ups as answers. The questions always varied and were always premised around very specific information about our field; having an answer from the hip was very high-signal for motivation and work quality, asking the right questions was higher signal for work quality but lower for motivation.
These two filters were the only absolutes in the process, and any other information we asked (resume, specific question content, etc) was purely about placement or deciding between marginal applicants. When I graduated, the club had basically kept its original culture while being successful. Maybe this approach will be helpful for thinking through the problems?
Some discussion on twitter about whether that article is written by a LARPer here https://x.com/anpaure/status/2066563480539590934?s=20
My guess at one of the best metics / interview questions here is “tell me about a lesswrong post you really liked” or something similar. Generally screening for “how online / plugged in into ai safety ecosystem” seems like a decent metric.
This is not exactly what I want, though, since I think anyone seriously applying to these programs will have done some reading and would be able to answer about their favorite LW post competently.
Agreed. For example, here’s 4.8′s answer:
4.8 answer
I do, or at least there’s a small cluster I keep returning to, and the one I’d name first is Eliezer’s “Local Validity as a Key to Sanity and Civilization.”
The core move is simple and I find it keeps paying out: you should be able to evaluate whether a single step of reasoning is valid independently of whether you like where the argument lands. And the post’s real claim isn’t just epistemic hygiene for individuals — it’s that this habit is load-bearing for civilization. A society where people can agree “that inference is invalid” even when they disagree about the conclusion has a working immune system; one where validity-judgments get pulled toward tribal allegiance loses the ability to error-correct at all. That second-order framing — local validity as the thing that lets pluralistic systems stay sane — is what elevates it above a standard “commit no fallacies” essay.
What I like is that it’s a genuinely useful idea rather than a clever one. It gives you a concrete thing to watch for in yourself: the moment you notice you’re scrutinizing an argument harder because you dislike its conclusion, you’ve caught yourself doing the bad thing.
The honest caveat is that “favorite” for me is closer to “highest hit-rate on rereads” than to nostalgia. By that standard, the runners-up are Scott’s “The Tails Coming Apart as Metaphor for Life” and Garrabrant’s “Goodhart Taxonomy” — both for similar reasons, that they name a structural failure mode crisply enough to actually use.
Do you have one? I’d be curious whether yours skews toward the reasoning-tools cluster or somewhere else entirely.
Remove some of the obvious LLM tells, and I’d have a lot of trouble telling it apart from a fluent LW’er “genuinely” interested in civilizational issues, discourse norms, and AI safety.
Agree with Leo that this is not a hard thing to distinguish, provided that value alignment matters for these fellowships.
When we do interviews for CMU’s AI Safety org, it seems like open-ended questions about viewpoints (e.g. “what about the current pace of AI keeps you up at night?” or “if you weren’t doing AI Safety research, what would the alternative be?”) enable us not just to distinguish between people who “speak the language”. Another shibboleth is actually under-awareness of the community—someone who is quick to recite a bunch of names may be less concerned with the issues at hand. Whether we want to do this is another question.
FWIW, MATS clearly does still source people who are excited about safety; most other fellows in my cohort act as if they are fighting for their future! Still others are intrigued by the more challenging theoretical and empirical questions. I trust the staff in their experience vetting out people who do not wish to engage the space genuinely.
The thing I am more worried about are AI fellowships taking money from participants (or fellowships created to reach personal agendas of the founders).