General advice for transitioning into Theoretical AI Safety

Update 2025: This post is completely out of date, don’t follow its advice.

Cross-posted from the EA Forum.

During the past months I’ve privately talked with +20 AI Safety researchers^[1] about how to transition into Theoretical AI Safety . Here’s a distillation of their advice and opinions (both general consensus and controversial takes).

Why this post?

Some great posts already exist with general advice for transitioning into AI Safety. However, these and others are mainly centered around technical Computer Science research and Machine Learning engineering. They don’t delve into how more theoretical aptitudes and background (such as careers in Mathematics, theoretical Computer Science or Philosophy) can be steered into more abstract Alignment research that exploits them (except for Critch’s short post). I think that’s mainly because:

Most people in the Alignment community work in applied research. That is to say, there are (way) more open positions for this kind of research. Still, most people I’ve talked to agree we need a non-trivial percentage of the community working in Theory (and we certainly need more people working in any subfield of Alignment they want to work on^[2]).
Applied career paths are more standardized, and thus easier to give advice on. As happens more broadly in academia and the job market, there are more obvious job prospects for the more applied. Theoretical careers tend to feel more like creating your own path. They often involve autonomous projects or agendas, individualized assessment of learning and evolution, and informally improving your epistemics.

This post tries to fill that gap by being a helpful first read for graduates/researchers of abstract disciplines interested in AI Alignment. I’d recommend using it as a complement to the other more general introductions and advice. The two following sections are just a summary of general community knowledge. The advice sections do include some new insights and opinions which I haven’t seen comprehensively presented elsewhere.

Why Alignment research?

I presuppose familiarity with the basic arguments encouraging AI Alignment research. This is a somewhat risky and volatile area to work on for positive impact, given how little we understand the problem, and so I recommend having a good inside view of this field’s theory of change (and our doubts about it) before committing hard to any path (and performing a ladder of tests to check for your personal fit).

Of course, I do think careers in AI Safety have expected positive impact absurdly larger than the median, only equaled by other EA cause areas. Furthermore, if you’re into an intellectual challenge, Alignment is one of the most exciting, foundational and mind-bending problems humanity is facing right now!

Why Theoretical Alignment research?

There are sound arguments for the importance of theoretical research, since its methods allow for more general results that might scale beyond current paradigms and capabilities (which is what mainly worries us). But we are not even sure such general solutions should exist, and carrying this research out faces some serious downsides, such as the scarcity of feedback loops.

Truth is, there’s no consensus as to whether applied or theoretical^[3] research is more helpful for Alignment. It’s fairly safe to say we yet don’t know, and so need people working in all fronts. If you might have especially good aptitudes for abstract thinking, mathematics, epistemics and leading research agendas, theoretical research might be your best chance for impact. That is, given uncertainty about each approach’s impact, you should mainly maximize personal fit.

That said, I’d again encourage developing a good inside view to judge for yourself whether this kind of research can be useful. And of course, trying out some theoretical work early on doesn’t lock you out of applied research.

In theoretical research I’m including both:

Mathematical research (sometimes referred to as “Theoretical research”): Using mathematics to pin down how complex systems will behave. This usually involves use of Decision Theory, Game Theory, Statistics, Logic or constructing abstract structures. Its goal is mostly deconfusing about fundamental concepts. Prime examples are Garrabrant’s Logical Induction or Vanessa Kosoy’s work.
Conceptual research: Using epistemics, philosophy and everything else you can think of to devise approaches to the Alignment problem. This usually involves hypothesizing, forecasting, consequentialist thinking, building plans and poking holes in them. Its goal is mostly coming up with a global big picture strategy (with many moving parts and underspecified details) that might totally or partially assure Alignment. Prime examples are Christiano’s ELK or Critch’s ARCHES.

This research is mainly carried out in private organizations, in academia or as independent funded research. Any approach will need or heavily benefit from basic understanding of the current paradigm for constructing these systems (Machine Learning theory), although some do abstract away some details of this paradigm.

As a preliminary advice, the first step in your ladder of tests can be just reading through ELK and thinking for a non-trivial amount of hours about it, trying to come up with strategies or poke holes in others. If you enjoy that, then you’d probably not be miserable partaking in Theoretical AI Safety. If furthermore you have or can acquire the relevant theoretical background, and your aptitudes excel for that, then you’re probably a good fit. Way more details can be weighed to assess your fit: how comfortable you will be tackling a problem about which everyone is mostly confused, how good you are at autonomous study/self-directed research, whether you’d enjoy moving to another country, how anxious deadlines or academia make you feel...

The following advice presuppose that you have already reached the (working) conclusion that this field can be impactful enough and you are a good fit for it.

Consensus advice

These points were endorsed by almost everyone I talked to.

Skilling up

Familiarize yourself with research agendas. Read through them, try understanding why they’re trending or why they failed, what they predict and promulgate. This will provide a very necessary familiarity with technical concepts and terms floating in the community, as well as let you elucidate which are the core main difficulties of Alignment that need tackling. This is somewhat equivalent to an applied researcher reading lots of papers.
You’ll need to learn research skills. Other than the specific knowledge some subfield of Alignment requires, you’ll need the general skills for doing theoretical research. These include independence, resourcefulness, research taste… Most of them can hardly be learnt on your own, and so a certain mentoring feedback is almost required. These skills are usually obtained in a PhD (even if unrelated to Alignment), but there are certainly other (and, according to some researchers, better) ways to do so, such as participating in a mentoring Alignment program, convincing an Alignment researcher to mentor you individually, or working in a private organization.
We need people thinking about The Whole Problem™. Although some theoretical research tackles concrete, helpfully insulated problems, most researchers agree one thing we’re certain to need is a complete and coherent strategy/realization/development that brings us significantly closer to Alignment. This should be most naturally achieved by people with an accurate model of the whole problem in their head. This might not be the task for you, but if you excel at building big pictures, if you feel more like a bird than a frog, you probably should try and slowly construct such broad a model. Even if you don’t feel like these are your aptitudes, developing a whole (even if coarse) picture of the field will prove vital to direct your research in the most promising direction.
You can probably learn basic Machine Learning theory on the go, while developing mathematical skills usually requires more focused attention. Put another way: people without a mathematical background often struggle with prerequisites for research such as linear algebra, while mathematicians rarely have trouble understanding ML. That said, not only acquiring the knowledge is important, but also being able to prove that you have it (for hireability). Probably your theoretical research posts/articles will already demonstrate (and build upon) that understanding, solving the problem. Otherwise you could distill research for which that understanding is necessary, write something (or solve exercises) about ML, or even post some code on github (more usual among applied researchers).
You probably don’t need the overly complex, deepest ends of graduate mathematics to do Alignment research. Consider the following example. MIRI’s early approaches revolved strongly around some aspects/results of Mathematical Logic. Still, not even Garrabrant had completed a PhD in Logic, or was a world-class logician. He only made use of relatively standard concepts and procedures of the field, which are probably closer to the level of a recently graduated mathematician than a recently doctorated one. In general, even if you need to learn the basics of a discipline to work in a certain research direction, this learning will at some time yield diminishing returns, and fairly soon your time will be better spent directly applying that to Alignment. That said, that advanced study will certainly help you skill up intellectually (but this might still not be the best way for that, or the best use of your time).
Being in the right intellectual environment is important. Being close to the right peers and potential collaborators will prove immensely positive both for learning and strategic advantages, especially for research as open-ended as this. Researchers I’ve talked to disagree about how much weight you should put on this point when deciding your next career step. Some go to the extreme of recommending moving to Berkeley if possible, others believe you only need an exciting intellectual environment of any kind (even if not related to Alignment) as can be found for instance in academia, in which to find some peer with whom to discuss Alignment. In any event, the average probably gives more importance to this point than we usually do.

Building your career

Talk to people! You can reach out (both to students and researchers) asking for advice on your study agenda, research interests or career decisions. Many EAs (for instance, myself!) will be glad to have an informal chat with you in which they share their model of the field/community. That said, don’t contact popular/busy researchers for general advice which almost anyone in the community could provide for you. If you do think some researcher is especially well-suited to answer some specific question or provide an opinion, make it fast and easy for them to read and respond (they’re more likely to answer that way!)^[4].
Try getting into an EA-aligned team/organization. This is the conventional path into a paid position. You start by working temporarily in some of these organizations, or in an academic group under the supervision of an Alignment researcher, as part of an internship, training program or research stay. Keep building upon these experiences and contacts, partake in more, and eventually get hired by one of these places. Disclaimer: Many non-EA AI researchers across the globe study topics seemingly resonating with the Alignment problem (regarding for instance value alignment and normativity). These researchers are usually not explicitly worried about the coming of AGI for existential reasons, and so the best way to ensure your work is helpful to the problem is by participating in a research environment which clearly shares EA’s worries.
And for that, apply everywhere! When thinking about career plans, I considered several frameworks: best- and worst-case outcomes for a path, having a preferred and a backup plan, etc. Another useful framework is trying to get your foot in the door. That is, whilst continuing with your plan, try and land one of the starting positions mentioned above. Opportunities like these abound (just keep an eye on the Alignment forum), although many are competitive (and so you’ll probably have to learn to deal with some rejections). Still, applying is certainly worth the cost (and benefits both you and the community).
You can try and build research collaborations out of thin air. This is more independent and less conventional, but still a great option (especially if it’s complementing other plans). Read, understand and think about a researcher’s agenda, shortly explain to them what your conceptualization of it is and how you think you can help, and they might be willing to mentor you for an hour a week. With the endorsement of a reputable researcher, you can ask for funding for some months of your independent research.
Apply for funding. Many sources exist, and are open to fund many kinds of enterprises. For instance, it’s common for EAs taking an ML master’s (with the plan of later pursuing applied research) to get funded. Since no clear study program is required or clearly optimal for theoretical Alignment research, it’s more common to get funding for your independent research (either to solely dedicate to independent research, or to get out of the teaching requirements of a PhD or postdoc, and have more free time for research). Again, apply everywhere it seems sound and be ready to deal with some rejections.
Post your study/reflexions/research. Writing things up will help you pin down your ideas and skill up on communication. Feedback will speed up your progress. Uploaded content will serve as a portfolio for applications. You can work on distillations, assessing other agendas, mathematical results upon other theories, original reflexions… And when in doubt about an idea, err on the side of publishing your thoughts. The sooner you start public production, the sooner your production can become good.
Keep a career plan in mind. Strategizing over your future career is probably not as fun as thinking about Alignment, but assessing your options, introspecting about what you want, and drawing a path towards your dream position will increase your chances of getting there; and this can multiply your positive impact! 80k provides many valuable resources, including free coaching (AISS does as well). I’ve found especially useful their article for short-term decision making.

Controversial advice

These points prompted radically different opinions amongst the researchers I’ve talked to, and some portray continuous debates in the community.

How valuable is academia? Academia comes with its bureaucratic baggage, and furthermore some researchers think its structure is considerably suboptimal for research. You’re incentivized not to pursue the most promising approaches if they’re too far afield. And if you’re not lucky you might be bound to teaching requirements you don’t find impactful. If these constraints seem too annoying, you might not be comfortable or as impactful in academia. Some researchers downplay these concerns, or stress the political need for maintaining bounds with academia for the soft power, or the influx of Alignment researchers it can yield.
Is it required for your studies to be directly related to Alignment? PhDs whose topic is explicitly relevant to Alignment (as in CHAI) are still relatively rare. Some researchers (especially the most applied) consider as a top priority entering a good Computer Science or Machine Learning PhD (regardless of its topic). Others (especially the most theoretical) put less weight on the specific program of your higher studies, and more in skilling up intellectually and having free time to think about Alignment. As a consequence, they consider PhDs in Maths or relevant philosophical areas (like Decision Theory) with lots of freedom as a good option. This is probably in part because they do think those kinds of refined abstract reasoning are very beneficial to Alignment. This doesn’t mean they endorse a PhD as the best option: some of them still defend whatever career option provides you with the most time to think about Alignment is best.
Study deeply other researchers’ world models, or look at the issue with fresh eyes? Some researchers recommend to start by putting on a senior researcher’s worldview as a pair of glasses, and working from there. This readily provides some tools which would take a long time to develop from scratch, and quickly teaches some general skills. According to them, the downside of accepting a theory on faith is small, since critical thinking is encouraged and the junior will eventually form their own views. Some other researchers (especially those more pessimistic about current work) rather recommend looking at the core issues from a distance and coming up with your own ideas and intuitions, in hopes that one of these unadulterated juniors will come up with a drastically different new approach that is more promising (that is, instead of optimizing for median output, they’re searching for a game-changing stroke of luck). I think there’s again an argument for people partaking in both approaches, according to personal fit.
Independent (non-academic) research might be great for you, but when should you embark in it? Independent research has some clear benefits, if you have the right traits for it. Some researchers recommend just maximizing the amount of time you think about Alignment, and so encourage trying out independent research even fresh out of undergrad. Others express worries that this might be too soon: most independent researchers have some experience in academic research (in other areas), and allegedly that’s why they perform well in self-directed research. You’d be competing for funding with students who have spent some years in a PhD. If your economic situation is relaxed, independent research could be a very good use of your time (provided you’d obtain the necessary mentorship).
What makes you hireable? The factors increasing your hireability are pretty obvious: fruitful writings/collaborations/internships on the topic of Alignment, relevant theoretical research experience, excelling in abstract thinking or mathematical production, etc. But different institutions/individuals give different weights to these. Some are satisfied by a history of non-trivial mathematical results, even if completely unrelated to Alignment. Others require some non-trivial understanding of current trends in Alignment, or experience dealing with the topic. In any event, the discrepancies don’t seem big.

^
If you’re one of those researchers and would like nominal recognition in this post, let me know. I’ve taken the approach of default anonymity given the informal and private nature of some of the discussions.
^
“AI Safety Technical research” usually refers to any scientific research trying to ensure the default non-harmfulness of AI deployment, as opposed to the political research and action in AI Governance. It thus includes the Theoretical/Conceptual research I talk about in this post despite the applied connotations of the word “Technical”.
^
The division between applied and theoretical is not binary but gradual, and even theoretical researchers usually construct upon some empirical data.
^
I’ve been positively surprised as to how many popular researchers have been kind enough to answer some of my specific questions.