What I would do if I wasn’t at ARC Evals
In which: I list 9 projects that I would work on if I wasn’t busy working on safety standards at ARC Evals, and explain why they might be good to work on.
Epistemic status: I’m prioritizing getting this out fast as opposed to writing it carefully. I’ve thought for at least a few hours and talked to a few people I trust about each of the following projects, but I haven’t done that much digging into each of these, and it’s likely that I’m wrong about many material facts. I also make little claim to the novelty of the projects. I’d recommend looking into these yourself before committing to doing them. (Total time spent writing or editing this post: ~8 hours.)
Standard disclaimer: I’m writing this in my own capacity. The views expressed are my own, and should not be taken to represent the views of ARC/FAR/LTFF/Lightspeed or any other org or program I’m involved with.
Thanks to Ajeya Cotra, Caleb Parikh, Chris Painter, Daniel Filan, Rachel Freedman, Rohin Shah, Thomas Kwa, and others for comments and feedback.
I’m currently working as a researcher on the Alignment Research Center Evaluations Team (ARC Evals), where I’m working on lab safety standards. I’m reasonably sure that this is one of the most useful things I could be doing with my life.
Unfortunately, there’s a lot of problems to solve in the world, and lots of balls that are being dropped, that I don’t have time to get to thanks to my day job. Here’s an unsorted and incomplete list of projects that I would consider doing if I wasn’t at ARC Evals:
Ambitious mechanistic interpretability.
Getting people to write papers/writing papers myself.
Creating concrete projects and research agendas.
Working on OP’s funding bottleneck.
Working on everyone else’s funding bottleneck.
Running the Long-Term Future Fund.
Onboarding senior(-ish) academics and research engineers.
Extending the young-EA mentorship pipeline.
Writing blog posts/giving takes.
I’ve categorized these projects into three broad categories and will discuss each in turn below. For each project, I’ll also list who I think should work on them, as well as some of my key uncertainties. Note that this document isn’t really written for myself to decide between projects, but instead as a list of some promising projects for someone with a similar skillset to me. As such, there’s not much discussion of personal fit.
If you’re interested in working on any of the projects, please reach out or post in the comments below!
Relevant beliefs I have
Before jumping into the projects I think people should work on, I think it’s worth outlining some of my core beliefs that inform my thinking and project selection:
Importance of A(G)I safety: I think A(G)I Safety is one of the most important problems to work on, and all the projects below are thus aimed at AI Safety.
Value beyond technical research: Technical AI Safety (AIS) research is crucial, but other types of work are valuable as well. Efforts aimed at improving AI governance, grantmaking, and community building are important and we should give more credit to those doing good work in those areas.
High discount rate for current EA/AIS funding: There’s several reasons for this: first, EA/AIS Funders are currently in a unique position due to a surge in AI Safety interest without a proportional increase in funding. I expect this dynamic to change and our influence to wane as additional funding and governments enter this space. Second, efforts today are important for paving the path to future efforts in the future. Third, my timelines are relatively short, which increases the importance of current funding.
Building a robust EA/AIS ecosystem: The EA/AIS ecosystem should be more prepared for unpredictable shifts (such as the FTX implosion last year). I think it’s important to robustify parts of the ecosystem, for example by seeding new organizations, building more legible credentials, doing more broad (as opposed to targeted) outreach, and creating new, independent grantmakers.
The importance of career stability and security: A lack of career stability hinders the ability and willingness of people (especially junior researchers) to prioritize impactful work over risk-averse, safer options. Similarly, cliffs in the recruitment pipeline due to a lack of funding or mentorship discourage pursuing ambitious new research directions over joining an existing lab. Personally, I’ve often worried about my future job prospects and position inside the community, when considering what career options to pursue, and I’m pretty sure these considerations weigh much more heavily on more junior community members.
Technical AI Safety Research
My guess is this is the most likely path I’ll take if I were to leave ARC Evals. I enjoy technical research and have had a decent amount of success doing it in the last year and a half. I also still think it’s one of the best things you can do if you have strong takes on what research is important and the requisite technical skills.
Caveat: Note that if I were to do technical AI safety research again, I would probably spend at least two weeks figuring out what research I thought was most worth doing, so this list is necessarily very incomplete. There’s also a decent chance I would choose to do technical research at one of OpenAI, Anthropic, or Google Deepmind, where my research projects would also be affected by management and team priorities.
Ambitious mechanistic interpretability
One of the hopes with mechanistic (bottom-up) interpretability is that it might succeed ambitiously: that is, we’re able to start from low-level components and build up to an understanding of most of what the most capable models are doing. Ambitious mechanistic interpretability would clearly be very helpful for many parts of the AIS problem, and I think that there’s a decent chance that we might achieve it. I would try to work on some of the obvious blockers for achieving this goal.
Here’s some of the possible broad research directions I might explore in this area:
Defining a language for explanations and interpretations. Existing explanations are specified and evaluated in pretty ad-hoc ways. We should try to come up with a language that actually captures what we want here. Both Geiger and Wu’s causal abstractions and our Causal Scrubbing paper have answers to this, but both are unsatisfactory for several reasons.
Metrics for measuring the quality of explanations. How do we judge how good an explanation is? So far, most of the metrics focus on the extensional equality (that is, how well the circuit matches their input-output behavior), but there are many desiderata besides that. Does percent loss recovered (or other input-output only criteria) suffice for recovering good explanations? If not, can we construct examples where it fails?
Finding the correct units of analysis for neural networks. It’s not clear what the correct low-level units of analysis are inside of neural networks. For example, should we try to understand individual neurons, clusters of neurons, or linear combinations of neurons? It seems pretty important to figure this out in order to e.g automate mechanistic interpretability.
Pushing the Pareto frontier on quality <> realism of explanations. A lot of manual mechanistic interpretability work focuses primarily on scaling explanations to larger models, as opposed to more complex tasks or comprehensive explanations, which I think are more important. In order for ambitious mechanistic interpretability to work out, we need to understand the behavior of the networks to a really high degree, instead of e.g. the ~50% loss recovered we see when performing Causal Scrubbing from the Indirect Object Identification paper. At the same time, existing mech interp work continues to primarily focus on simple algorithmic tasks, which seems like it misses out on most of the interesting behavior of the neural networks.
How you can work on it: Write up a research agenda and do a project with a few collaborators, and then start scaling up from there. Also, consider applying for the OpenAI or Anthropic interpretability teams.
Core uncertainties: Is the goal of ambitious mechanistic interpretability even possible? Are there other approaches to interpretability or model psychology that are more promising?
Late stage project management and paper writing
I think that a lot of good AIS work gets lost or forgotten due to a lack of clear communication. Empirically, I think a lot of the value I provided in the last year and a half has been by helping projects get out the door and into a proper paper-shaped form. I’ve done this to various extents for the modular arithmetic grokking paper, the follow-up work on universality, the causal scrubbing posts, the ARC Evals report, etc. (This is also a lot of what I’m doing at ARC Evals nowadays.)
I’m not sure exactly how valuable this is relative to just doing more technical research, but it does seem like there are many, many ideas in the community that would benefit from a clean writeup. While I do go around telling people that they should write up more things, I think I could also just be the person writing these things up.
How you can work on it: find an interesting mid-stage project with promising preliminary results and turn it into a well-written paper. This probably requires some amount of prior paper-writing experience, e.g. from academia.
Core uncertainties: How likely is this problem to resolve itself, as the community matures and researchers get more practice with write-ups? How much value is there in actually doing the writing, and does it have to funge against technical AIS research?
Creating concrete projects and research agendas
Both concrete projects and research agendas are very helpful for onboarding new researchers (both junior and senior) and for helping to fund more relevant research from academia. I claim that one of the key reasons mechanistic interpretability has become so popular is an abundance of concrete project ideas and intro material from Neel Nanda, Callum McDougal, and others. Unfortunately, the same cannot really be said for many other subfields; there isn’t really a list of concrete project ideas for say, capability evals or deceptive alignment research.
I’d probably start by doing this for either empirical ELK/generalization research or high-stakes reliability/relaxed adversarial training research, while also doing research in the area in question.
I will caveat that I think many newcomers write these research agendas with insufficient familiarity of the subject matter. I’m reluctant to encourage more people without substantial research experience to try to do this; my guess is the minimal experience is somewhere around one conference paper–level project and an academic review paper of a related area.
How you can work on it: Write a list of concrete projects or research agenda in a subarea of AI safety you’re familiar with. As discussed before, I wouldn’t recommend attempting this without significant amounts of familiarity with the area in question.
Core uncertainties: Which research agendas are actually good and worth onboarding new people onto? How much can you actually contribute to creating new projects or writing research agendas in a particular area without being one of the best researchers in that area?
I think there are significant bottlenecks in the EA-based AI Safety (AIS) funding ecosystem, and they could be addressed with a significant but not impossible amount of effort. Currently, the Open Philanthropy project (OP) gives out ~$100-150m/year to longtermist causes (maybe around $50m to technical safety?), and this seems pretty small given its endowment of maybe ~$10b. On the other hand, there just isn’t much OP-independent funding here; SFF maybe gives out ~$20m/year, LTFF gives out $5-10m a year (and is currently having a bit of a funding crunch), and Manifund is quite new (though it still has ~$1.9M according to its website).
Caveat: I’m not sure who exactly should work in this area. It seems overdetermined to me that we should have more technical people involved, but a lot of the important things to do to improve grantmaking are not technical work and do not necessitate technical expertise.
Working on Open Philanthropy’s Funding Bottlenecks
(Note that I do not have an offer from OP to work with them; this is more something that I think is important and worth doing as opposed to something I can definitely do.)
I think that the OP project is giving way less money to AI Safety than it should be under reasonable assumptions. For example, funding for AI Safety probably comes with a significant discount rate, as it’s widely believed that we’ll see an influx of funding from new philanthropists or from governments, and also it seems plausible that our influence will wane as governments get involved.
My impression is mainly due to grantmaker capacity constraints; for example, Ajeya Cotra is currently the only evaluator for technical AIS grants. This can be alleviated in several ways:
Most importantly, working at OP on one of the teams that does AIS grantmaking.
Helping OP design and run more scalable grantmaking programs that don’t significantly compromise on quality. This probably requires working with them for a few months; just creating the RFP doesn’t really address the core bottleneck.
Creating good scalable alignment projects that can reliably absorb lots of funding.
How you can work on it: Apply to work for Open Phil. Write RFPs for Open Phil and help evaluate proposals. More ambitiously, create a scalable, low-downside alignment project that could reliably absorb significant amounts of funding.
Core uncertainties: To what extent is OP actually capacity constrained, as opposed to pursuing a strategy that favors saving funding for the future? How much of OP’s decision comes down to different beliefs about e.g. takeoff speeds? How good is broader vs more targeted, careful grantmaking?
Working on the other EA funders’ funding bottlenecks
Unlike OP, which is primarily capacity constrained, the remainder of the EA funders are funding constrained. For example, LTFF currently has a serious funding crunch. In addition, it seems pretty bad for the health of the ecosystem if OP funds the vast majority of all AIS research. It would be significantly healthier if there were counterbalancing sources of funding.
Here are some ways to address this problem: First and foremost, if you have very high earning potential, you could earn to give. Second, you can try to convince an adjacent funder to significantly increase their contributions to the AIS ecosystem. For example, Schmidt Futures has historically given significant amounts of money to AI Safety/Safety-adjacent academics, it seems plausible that working on their capacity constraints could allow them to give more to AIS in general. Finally, you could successfully fundraise for LTFF or Manifund, or start your own fund and fundraise for that.
How you can work on it: Convince an adjacent grantmaker to move into AIS. Fundraise for AIS work for an existing grantmaker or create and fundraise for a new fund. Donate a lot of money yourself.
Core uncertainties: How tractable is this, relative to alleviating OP’s capacity bottleneck? How likely is this to be fixed by default, as we get more AIS interest? How much total philanthropic funding would be actually interested in AIS projects? How valuable is a grantmaker who potentially doesn’t share many of the core beliefs of the AIS ecosystem?
Chairing the Long-Term Future Fund
(Note that while I am an LTFF guest fund manager and have spoken with fund managers about this role, I do not have an offer from LTFF to chair the fund; as with the OP section, this is more something that I think is important and worth doing as opposed to something I can definitely do.)
As part of the move to separate the Long-Term Future from Open Philanthropy, Asya Bergal plans to step down as LTFF Chair in October. This means that the LTFF will be left without a chair.
I think the LTFF serves an important part of the ecosystem, and it’s important for it to be run well. This is both because of its independent status from OP and because it’s the primary source of small grants for independent researchers. My best guess is that a well-run LTFF (even) could move $10m a year. On the other hand, if the LTFF fails, then I think this would be very bad for the ecosystem.
That being said, this seems like a pretty challenging position; not only is the LTFF currently very funding constrained (and with uncertain future funding prospects) and its position in Effective Ventures may limit ambitious activities in the future.
How you can work on it: Fill in this Google form to express your interest.
Core uncertainties: Is it possible to raise significant amounts of funding for LTFF in the long run, and if so, how? How should the LTFF actually be run?
I think that the community has done an incredible job of field building amongst university students and other junior/early-career people. Unfortunately, there’s a comparative lack of senior researchers in the field, causing a massive shortage of both research team leads and a mentorship shortage. I also think that recruiting senior researchers and REs to do AIS work is valuable in itself.
Onboarding senior academics and research engineers
The clearest way to get more senior academics or REs is to directly try to recruit them. It’s possible the best way for me to work on this is to go back to being a PhD student, and try to organize workshops or other field building projects. Here are some other things that might plausibly be good:
Connecting senior academics and REs with professors or other senior REs, who can help answer more questions and will likely be more persuasive than junior people without much legible credentials. Note that I don’t recommend doing this unless you have academic credentials and are relatively senior.
Creating research agendas with concrete projects and proving their academic viability by publishing early stage work in those research agendas, which would significantly help with recruiting academics.
Create concrete research projects with heavy engineering slants and with clear explanations for why these projects are alignment relevant, which seems to be a significant bottleneck for recruiting engineers.
Normal networking/hanging out/talking stuff.
Being a PhD student and influencing your professor/lab mates. My guess is the highest impact here is to do a PhD at a location with a small number of AIS-interested researchers, as opposed to going to a university without any AIS presence.
Note that senior researcher field building has gotten more interest in recent times; for example, CAIS has run a fellowship for senior philosophy PhD students and professors and Constellation has run a series of workshops for AI researchers. That being said, I think there’s still significant room for more technical people to contribute here.
How you can work on it: Be a technical AIS researcher with interest in field building, and do any of the projects listed above. Also consider becoming a PhD student.
Core uncertainties: How good is it to recruit more senior academics relative to recruiting many more junior people? How good is research or mentorship if it’s not targeted directly at the problems I think are most important?
Extending the young EA/AI researcher mentorship pipeline
I think the young EA/AI researcher pipeline does a great job getting people excited about the problem and bringing them in contact with the community, a fairly decent job helping them upskill (mainly due to MLAB variants, ARENA, and Neel Nanda/Callum McDougal’s mech interp materials), and a mediocre job of helping them get initial research opportunities (e.g. SERI MATS, the ERA Fellowship, SPAR). However, I think the conversion rate from that level into actual full-time jobs doing AIS research is quite poor.
I think this is primarily due to a lack of research mentorship for junior and/or research management capacity at orgs, and exacerbated by a lack of concrete projects for younger researchers to work on independently. The other issue is that many junior people can overly fixate on explicit AIS-branded programs. Historically, all the AIS researchers who’ve been around for more than a few years got there without going through much of (or even any of) the current AIS pipeline. (See also the discussion in Evaluations of new AI safety researchers can be noisy.)
Many of the solutions here look very similar to ways to onboard senior academics and research engineers, but there are a few other ones:
Encourage and help promising researchers pursue PhDs.
Creating and funding more internship programs in academia, to use pre-existing research mentorship capacity.
Run more internship or fellowship programs that lead directly to full-time jobs, in collaboration with (or just from) AIS orgs.
Come up with a promising AIS research agenda, and then work at an org and recruit junior researchers.
In addition, you could mentor more people yourself if you’re currently working as a senior researcher!
How you can work on it: Onboard more senior people into AIS. Encourage more senior researchers to mentor more new researchers. Create programs that make use of existing mentorship capacity, or that lead more directly to full-time jobs at AIS orgs.
Core uncertainties: How valuable are more junior researchers compared to more senior ones? How long does it take for a junior researcher to reach certain levels of productivity? How bad are the bottlenecks, really, from the perspective of orgs? (E.g. it doesn’t seem implausible to me that the most capable and motivated young researchers are doing fine.)
Writing blog posts or takes in general
Finally, I do enjoy writing a lot, and I would like to have the time to write a lot of my ideas (or even other people’s ideas) into blog posts.
Admittedly, this is primarily personal satisfaction–motivated and less impact-driven, but I do think that writing things (and then talking to people about them) is a good way to make things happen in this community. I imagine that the primary audience of these writeups will be other alignment researchers, and not the general LessWrong audience.
Here’s an incomplete list of blog posts I started in the last year that I unfortunately didn’t have the time to finish:
Ryan Greenblatt’s takes on why we should do ambitious mech interp (and avoid narrow or limited mech interp), which I broadly agree with.
Why most techniques for AI control or alignment would fail if a very powerful unaligned AI (an ‘alien jupiter brain’) manifested inside your datacenter, and why that might be okay anyways.
Why a lot of methods of optimizing or finetuning pretrained models (RLHF, BoN, quantilization, DPO, etc) are basically equivalent modulo (in theory) optimization difficulties or priors, and why people’s intuitions on differences between them likely come down to imagining different amounts of optimization power applied by different algorithms. (And my best guess as to the reasons for why they are significantly different in practice.)
The case for related work sections.
There are (very) important jobs besides technical AI research and how we as the community could do a better job at not discouraging people to take them.
Why the community should spend 50% less time talking about explicit status considerations.
There’s some chance I’ll try to write more blog posts in my spare time, but this depends on how busy I am otherwise.
How you can work on it: Figure out areas where people are confused, come up with takes that would make them less confused or find people with good takes in those areas, and write them up into clear blog posts.
Core uncertainties: How much impact do blog posts and writing have in general, and how impactful has my work been in particular? Who is the intended audience for these posts, and will they actually read them?
Anecdotally, it’s been decently easy for AIS orgs such as ARC Evals and FAR AI to raise money from independent, non-OP/SFF/LTFF sources this year.
Aside from the impact-based arguments, I also think it’s pretty bad from a deontological standpoint to convince many people to drop out or make massive career changes with explicit or implicit promises of funding and support, and then pull the rug from under them.
In fact, it seems very likely that I’ll do this anyway, just for the value of information.
For example, a high degree of understanding would provide ways to detect deceptive alignment, elicit latent knowledge, or provide better oversight; a very high degree of understanding may even allow us to do microscope or well-founded AI.
This is not a novel view; it’s also discussed under different names in other blog posts such as ‘Fundamental’ vs ‘applied’ mechanistic interpretability research, A Longlist of Theories of Impact for Interpretability, and Interpretability Dreams.
As the worst instance of this, the best way to understand a lot of AIS research in 2022 was “hang out at lunch in Constellation”.
The grants database lists ~$68m worth of public grants given out in 2023 for Longtermism/AI x-risk/Community Building (Longtermism), of which ~$28m was given to AI x-risk and ~$32m was given to community building. However, OP gives out significant amounts of money via grants that aren’t public.
This is tricky to estimate since the SFF has given out significantly more money in the first half 2023 (~$21m) than it has in all 2022 (~$13m).
CEA also gives out a single digit million worth of funding every year, mainly to student groups and EAGx events.
This seems quite unlikely to be my comparative advantage, and it’s not clear it’s worth doing at all – for example, many of the impressive young researchers in past generations have made it through without even the equivalent of SERI MATS.