A Proposal for a Better ARENA: Shifting from Teaching to Research Sprints
TLDR
I propose restructuring the current ARENA program, which primarily focuses on contained exercises, into a more scalable and research-engineering-focused model consisting of four one-week research sprints preceded by a dedicated “Week Zero” of fundamental research engineering training. The primary reasons are:
The bottleneck for creating good AI safety researchers isn’t the kind of knowledge contained in the ARENA notebooks, but the hands-on research engineering and research skills involved in day-to-day research.
I think the current version of ARENA primarily functions as a signaling mechanism in the current state of the AI safety ecosystem.
[Edit: as discussed in the comments, on reflection the scalability is not a primary issue or benefit.]
Context and disclaimers
This post was written using Superwhisper and then asking Gemini to transcribe into a blog post format. I have done some light editing. Some of this might look like AI slop. I apologize, but I think the value of this post is pretty good as is, and it is not a good use of my time to refine it further.
I am not saying that Arena is not valuable. Arena is obviously valuable, and deserves the high reputation it has in the AI safety ecosystem.
Why am I well positioned to think about this? In the past year and a half, I have participated in a large slew of AI safety schemes, both as a participant and as a teacher or lead. This includes ML4Good, both as a participant and as TA, SPAR as a participant, AI Safety Camp as a project lead, ARENA as a participant and as a TA, Algoverse both as a mentor and as a participant, BlueDot both as a participant and a facilitator, ARBOx as a TA. Furthermore, I am currently a research manager at MATS so I’m getting a close-up view of what skills are required to do high-quality AI safety research.
The views expressed here are my own and do not necessarily reflect the views of MATS.
The Core Problem with the Current ARENA
My primary concern is that the skills learned in the current ARENA program are not the bottleneck for the AI Safety ecosystem.
Skills Mismatch: AI safety research involves self-directed coding (with LLMs), making decisions about experimental design, setting up infrastructure, research taste, etc.. In contrast, ARENA exercises are typically small, well-contained, and have a black-and-white correct answer with pre-provided unit tests, removing the crucial element of uncertainty and decision-making present in real research.
Signaling vs. Upskilling: Based on my experience, the biggest benefit of the current program to the AI Safety community appears to be as a signaling mechanism for other programs. Two pieces of evidence. One is that many participants at ARENA have already done AI Safety research before participating. Second evidence is that at least four ARBOx (a 2-week compressed version of ARENA) are doing elite AI safety fellowships (1 Anthopic Fellows Program, 2 LASR Labs, 1 MATS).
Scalability Bottleneck: ARENA is fundamentally not scalable due to its reliance on TAs and the hands-on teaching model. MATS, for example, is scaling much faster (200+ people/year) compared to ARENA (approx. 75 people/year at 25 participants/cohort, three times/year).
The Proposed Research Sprint Format
The alternative program structure would be a four-week sequence of mini-research sprints, with each week having a different AI safety theme, plus an introductory Week Zero. This aligns with the advice from researchers like Neel Nanda on upskilling in mechanistic interpretability—study the relevant material, then start mini-sprints.
Application Process: ARENA Knowledge as a Prerequisite
The content of the existing ARENA notebooks could be a prerequisite for the new program.
Automated Testing: The application process would involve a test of familiarity with the content, possibly using automated quizzes or Anki flashcards created by the Arena team. This removes the marginal value provided by TAs (which will only diminish as LLMs improve at explanation) and frees up staff time.
Standard Selection: Other standard selection criteria used by programs like SPAR, Algoverse, and AI Safety Camp would still apply.
Program Structure
Week | Theme/Focus | Goal |
| Week Zero | Research Engineering & Soft Skills | Dedicated training on modern tools, workflows, and non-technical skills. |
| Week 1 | Mech Interp Sprint | Focused one-week research project. |
| Week 2 | Evals Sprint | Focused one-week research project. |
| Week 3 | Fine-Tuning/RL Model Organisms Sprint | Focused one-week research project. |
| Week 4 | Choice/Software Engineering Sprint | Participants choose a deep-dive topic or contribute to open-source packages. |
Week Zero: Dedicated Training
The goal for this (optional) week is to teach the actual skills needed for research.
Engineering Workflows: Go over fundamental modern research engineering skills. This includes setting up infrastructure, estimating GPU needs, making use of LLMs for coding (e.g., ClaudeCode), sandboxing with Docker, etc. One version of this includes just spending a day or two going through all the (relevant) tips in Tips and Code for Empirical AI Safety Research.
Broader Skills: Teach skills essential for a successful researcher that often get neglected:
Theory of impact exercises and AI Safety strategy
Project management frameworks
Reflection structures (e.g., what went well that day)
Applied rationality exercises
Collaboration and conflict resolution
The Software Engineering Week
A potential alternative for Week 4 is a pure Software Engineering Week, where participants contribute to open-source packages in collaboration with open-source maintainers. This is an excellent way to teach hard software engineering skills and build up “taste” for good software, which is a growing concern with the rise of LLM coding.
Partnership and Mentoring
To maximize value, ARENA could partner with research programs like MATS.
One-Week Mentors: Existing participants from the partner programs (e.g., MATS) would serve as one-week project mentors during the themed sprints.
Benefits for Mentors: This provides low-stakes, time-boxed practice for mentoring, which is a hard skill to learn—a significant advantage over a three-month-long project lead role (which I had in AI Safety Camp and found challenging).
Benefits for Participants: Increased networking, access to active researchers, direct guidance, and more relevant research directions set by the mentors.
ML4Good best practices
Any new structure should embed the good practices of programs like ML4Good to create a positive learning environment, a sense of community, and a safe space for both personal and technical growth. For details, see my post about it.
Scalability
[Edit: I no longer think this is an important or defining feature.]
The new model is significantly easier to scale:
Ditching TAs/Teaching: Eliminates a primary bottleneck for the current program.
Flexible Structure: The format is fundamentally flexible; anyone can pick it up and run a mini-sprint. The core ARENA team can provide structures and instructions to enable bottom-up organization globally. For example, can imagine something like “one week Mech Interp sprint at this university, application is 30 questions based on 4 notebooks, 80% pass rate”).
Online Feasibility: There is no fundamental reason this cannot be done online, similar to how Apart runs global hackathons.
Potential Downsides
[Edit: the strongest downsides have been suggested by commenters. Humans still got an edge over AI.]
One potential downside is the reduced incentive for the ARENA team to create new ARENA-style notebooks (e.g., for control research). However, since the team is already heavily bottlenecked on time for new notebook development, this might not be a real disadvantage. Both systems suffer from the same staffing problem.
Another downside is the implication that this has to replace ARENA. This could just be a separate parallel initiative. However, I do actually believe that the ARENA team and ARENA participants are better served moving more to a model I am suggesting.
I am actually struggling to think of downsides. I asked Gemini and here are its thoughts along with my counters:
Exclusion of true beginners/high barrier to entry.
ARENA already has high barrier to entry.
Risk of superficial projects and high failure rate.
Primary goal is upskilling. High failure rates are normal in research anyway
Inadequate mentoring depth in only 1 week
Primary aim isn’t to provide deep mentoring. Furthermore, if the mentor-mentee relationship is positive during the one week, there is nothing stopping them collaborating in the long run.
Gaming the automated prerequisite system
Yes that is a risk. But not a fundamental flaw of the idea. Furthermore, I am hoping that selection effects play a smaller role in this new format (where gaming of entry requirements matters most I think), because there are concrete outputs that can instead be judged. E.g. if somebody cheats there way in but produces bad outputs, then people running other programs should be wary. In other words, if people say they attended this program, they should be obliged to share their outputs to help other programs evaluate them properly.
Dilution of brand, if people globally run ‘bottom-up’ versions
This is already solved problem. Only the ARENA team can use the official brand, and any people who run versions of the ARENA program independently are obliged to acknowledge ARENA and obliged to make explicit they are not running an official ARENA program.
I asked Claude to review this post and it came up with some other downsides. Again, Claude’s comments followed by mine.
What if a one-week sprint produces low-quality research that hurts participants’ confidence?
This is part of learning experience, and we should have structures and advice in place to help deal with this.
What if rapid churn of different themes prevents deep learning?
This matches the pace of ARENA
What if removing the “ARENA experience” community-building aspect (by going online/distributed) reduces downstream networking value?
I am not suggesting removing the in-person experience.
Making It Happen
If you think this is a good idea, then the obvious question is how do we make this happen? Unfortunately, I probably don’t have the time to make this happen, but I’d definitely like to be involved. Possible next steps include:
Forming a core organizing group and writing a grant to CG. [Interesting side note: Gemini hallucinated here and instead recommended an ACX Grant, which I do not provide in its context. But my instructions to Gemini did mention I wanted a LessWrong post].
Trying a pilot iteration on a low-cost basis, such as at the EA Hotel.
Fleshing out details. I know many of the ideas above are preliminary, but there’s enough of a skeleton to get things going.
If you have any feedback or want to get involved, please share in the comments.
“The content of the existing ARENA notebooks could be a prerequisite for the new program”
I don’t think this would work very well. If you were super disciplined and you took one day every two weeks to work through one notework, you’d spend most of a year just to qualify for the program.
Also, shifting Arena to focus on research sprints would, in a sense, reduce the diversity of the field in that most other programs focus more on research than developing ML skills. If one program were to shift to doing research spints, I suspect it’d actually be better for a program that already focuses on research to do that.
> If you were super disciplined and you took one day every two weeks to work through one notework, you’d spend most of a year just to qualify for the program
I believe: 1) you don’t need to diligently work through a whole notebook to get most of the value of the notebook and 2) the majority of the value of ARENA is contained in a subset of the notebooks. Some reasons:
1a) The notebooks are often, by design, far more work than is possible to do in a day. Even in ARENA, where you have pair programming, TAs on hand, great co-working space, lunch and dinner provided, etc.. Note a ‘day’ here is roughly 5.5-6.5 hours. (10am to 6pm, morning lecture at 10, lunch at 12, break at 3.30)
1b) Even for the shorter notebooks, it is often only manageable to complete in a day if you skip some exercises, or cheat and peak at the solution. (This is recommended by ARENA, and I agree with this recommmendation given time constraints.)
1c) There are 5 (or 6) LARGE mech interp notebooks for final three days of mech interp week. One recommendation is to try two notebooks on the Wed and Thu, then continue with the one you prefer on Friday. So I saw 2 out of the 5 notebooks when I participated in ARENA. Despite this, I was still able to TA during mech interp. It was bit frantic, but I would skim the start of each of the notebooks I didn’t understand, enough that I could help people unblock or to explain key ideas. I feel I got good percent of the value that the ARENA participants got out of those other notebooks without having done a single exercise in them.
2a) In ARBOx2, the schedule was (comma represents different days)
- Week 1: CNNs, Transformers, Induction circuit, IoI circuit, [another mech interp notebook. cant remember which. likely SAEs]
- Week 2: RL day 1, RL day 2, project, project, project.
The biggest thing missing from this IMO is the theory of impact exercise from second half of ARENA evals day 1. Otherwise, for the calibre of ppl doing ARENA, a quick skim of the other notebooks gives majority of the value.
I would recommend ARBOx over ARENA because of the time efficiency. You get high percentage of value of ARENA, but in 40% of the time.
> most other programs focus more on research than developing ML skills
I dont think ARENA focusses on ML skills. Week 0 has content directly supervised ML, and only a small (but crucial!) part of ML, namely, writing networks in pytorch and creating training loops. Week 2 has content on RL. But given time constraints, many other parts of ML aren’t covered in depth, e.g. how to do hyper-parameter tuning (most of the time just use the hyper-parameters provided, there’s no time to actually do hyper-parameter tuning), how to even tell if hyper-parameters are the issue, data collection and cleaning, cluster management, selecting GPUs, etc.
Some brief feedback on the structure:
Realistically, students will be rusty and not able to immediately understand/code all four research areas. As a group, they likely will decide to review/redo relevant ARENA sections to ensure shared understanding.
Due to this, I suspect the organizers will have to provide very specific, ~20-30 hour projects with concrete goals/scaffolding. Otherwise, students will feel lost or overwhelmed having only a week both to get up to speed and do the research project. This is time for quick initial explorations but not much else.
It’s very likely this program will also need TAs. Given how quickly it switches topics, students will have tooling/code issues that take hours to resolve. They may also get lost or stuck not knowing what to try next.
IMO ARENA is about bringing skilled coders to the research frontier, and showing how to quickly run experiments. If you instead make ARENA a prereq, you will lose out on many talented coders who don’t have time to complete it independently. So I would consider this moreso a follow-up to ARENA to teach research skills than a replacement.
I think this is a relatively interesting idea as a stand-alone thing (i.e. running more week-long research sprints). This seems good and like it would be pretty useful in the AI safety community, and is underserved at the moment.
I don’t really think the framing of this as an alternative to ARENA makes much sense. Because ARENA’s main bottleneck to scaling hasn’t really been TAs. I mean it’s a bottleneck, don’t get me wrong, but I wouldn’t call it the main one. I also think having a less-structured, more research-driven model would probably require more TA involvement? If you wanted to do it well and have it be accessible to early-stage people, at least.
I’m confused about the evidence given that ARENA functions primarily as a signaling mechanism. I think that is certainly an aspect of ARENA (as with MATS, as with the completion of any high-quality program). But the fact that some people have done AI safety research before ARENA is weak evidence of this to me (I could go into more detail about this in private, but not willing to do so publicly since it pertains to ARENA’s selection process which I do not want to be gamed).
The fact that people from compressed versions of ARENA (ARBOx, which is a great program!) also go on to do great things doesn’t seem like evidence of this to me. In fact, this seems like evidence that completing a structured curriculum just makes you a better applicant to other programmes. Not sure how this is being interpreted (since this depends on what you perceive to be the signaling value of ARBOx, which I think is probably not as high as ARENA’s. I think it should be higher because it’s a great programme!)
Also we do put out impact reports where people self-assess as having improved their ability to do certain concrete tasks that we think are valuable in research. Won’t go into it in detail here because we’ve done so in impact reports in the past. E.g. The report for ARENA 6 is here https://www.lesswrong.com/posts/WiergDX4ufcLbvwzs/arena-6-0-impact-report (Side note that I would love to see more reports like this from other programs in the ecosystem, and on a more regular basis).
I have more thoughts but will leave it there. Think this project is probably worth pursuing independently (although if you’re getting early-stage people to do research-sprints, as I said earlier, I think you do need high-quality and involved mentorship that looks similar to TAs). Also think there’s a lot of people doing somewhat similar things, but maybe not quite 1-week sprints.
Thanks, James, for the detailed thoughts and for reading through the post. I’ll respond once here. If we want further back and forth, better to have a chat in private so we can iron out our cruxes (and then summarize for community benefit). I’d also want to hear what others in community think before committing to anything.
> Because ARENA’s main bottleneck to scaling hasn’t really been TAs
I am happy to defer to you regarding the scaling bottlenecks of Arena. That’s not a big crux for the proposal.
> I’m confused about the evidence given that ARENA functions primarily as a signaling mechanism
Maybe the word signaling isn’t correct. Let me try to explain. When I point out that there are four people who did ARBOx and are now doing elite fellowship programs, my hunch is that those four had a very good chance of getting into those elite programs, even if they hadn’t done ARBOx. Furthermore, if ARBOx did provide a significant boost to their profile/skillset, then one needs to consider how much extra value the extra three weeks at ARENA are providing. Another way of saying this is that ARBOx and Arena and these elite programs have similar selection processes. And so Arena or ARBOx accepting someone is strongly correlated with the fact that they have high potential for future AI safety research, regardless of how much value they add on top.
> I also think having a less-structured, more research-driven model would probably require more TA involvement? If you wanted to do it well and have it be accessible to early-stage people, at least.
I do not consider participants of ARENA to be ‘early stage’. In my mind they are mid-stage (i.e. middle of upskilling towards a full-time researcher role) and most participants would be able to do solid research sprints without having gone through ARENA. My proposal is based on helping such mid-stage researchers. I think something like BlueDot (at least, BlueDot in 2024, I dont know about current BlueDot) or AISC targets early-stage researchers.
> Also we do put out impact reports where people self-assess as having improved their ability
My claim (which I have not really justfied, except to defer to Neel Nanda’s post) is that the counter-factual of doing four mini research sprints would be significantly higher impact. This could be the central crux.
> Side note that I would love to see more reports like this from other programs in the ecosystem, and on a more regular basis
100%. Thanks for doing this and being a role model!
Sure, I agree with most of that. I think this is probably mostly based on counterfactuals being hard to measure, in two senses:
The first is the counterfactual where participants aren’t selected for ARENA, do they then go on to do good things. We’ve taken a look at this (unpublished) and found that for people who are on the margin attendance at ARENA has an effect. But then that effect could be explained by signaling value. It’s basically difficult to say. This is why we try and do start-of-program and end-of-program surveys to measure this. But different viewpoints are available here because it is difficult to measure definitively.
The second is the counterfactual where people spend 4 weeks doing research sprints. I basically do expect that to be more effective if you require the ARENA materials as prerequisites, but I think it would then be hard to actually get applicants to such a programme (since people generally struggle to work through ARENA materials themselves). But maybe something else could work here. I actually kind of expect the counterfactual of that to be pretty low due to margin-based reasoning, where there exist many research-oriented programmes already, but relatively fewer upskilling-oriented programmes. But again, difficult to know definitively what’s more valuable on current margins (though I do think on current margins is the relevant question).
My guess is these are the two cruxes? But unsure.
This is not crux for me. I believe ARENA provides counter-factual value compared to not doing ARENA. You work much harder during ARENA than you otherwise would, in great environment, great support, etc.
> The second is the counterfactual where people spend 4 weeks doing research sprints.
This is crux. And agreed it is hard to measure!
Thanks for engaging thoughtfully. Useful to think things through.
I don’t think this is evidence that ARENA is about signalling:
MATS and LASR are completely different from ARENA. I wouldn’t find anything fishy about someone doing ARENA after LASR/MATS.
If anything, someone doing ARENA after MATS is evidence against the hypothesis, because the signal of (MATS+ARENA) is no better than MATS alone.
It sounds a little odd to do ARBOx after Anthropic Fellows. But I’d guess they wanted a fun two weeks hanging out in Oxford with other young people. Not that they thought the signal of (AFP + ARBOx) was so much better than AFP alone (which sounds absurd to me).
[minor correction]
I agree this would be odd; I think you misinterpreted—those alumni did the other AI safety fellowships after participating in ARBOx.
On 3., I think the four ARBOx fellows cited above did ARBOx first, then went on to AFP etc. I understand the argument to be “despite ARBOx being only two weeks, it has a good placement record, so why is ARENA 2.5x longer?”
I’ve worked on admissions/strategy/evaluation for ARBOx2 and ARBOx3, and I’d broadly endorse JamesH’s comments—I’d be very excited to see the programme you describe, but as a next step/complement to ARENA-type programmes rather than a substitute.
A few thoughts on the signalling/upskilling conversation:
Based on surveys and private anecdata, it seems like a large part of ARBOx’s value comes from building confidence, motivation, and a sense of direction for participants. I think it’s hard to replicate this benefit through self-studying the ARENA curriculum, and I’d say it’s a significant component of ARBOx’s past/ongoing impact. This could explain ARBOx’s strong placement record despite a compressed runtime (since these benefits probably scale sub-linearly with more weeks).
Moreover, participants reported that ARBOx was helpful for them “developing technical skills in ML for AI safety” (avg 9.15/10 agreement, much higher than statements pertaining to perceived signalling benefits). Of course, different participants benefit from ARBOx in different ways, and end-of-programme self-reports can be misleading (we’re doing a six-month follow-up now), but I think the “signalling” story is at best incomplete.
I also don’t think the signalling story follows from many ARENA/ARBOx applicants having prior safety research experience. Eyeballing the survey data, participants with (more) prior safety research experience didn’t find ARBOx any less useful for developing their technical skills (if anything, the opposite is true, although there are too many confounders to infer anything meaningful).
A few thoughts on time-efficiency:
I do think ARBOx is pretty time-efficient, which is very important to a lot of our participants (e.g. it fits in the two weeks between New Year and Oxford term starting). I don’t dispute that a minority of ARENA notebooks comprise the majority of the curriculum’s value, and that many ARBOx participants get >40% of the value they’d get from ARENA in 40% of the time. For this reason, I’d be excited to see more people put on ARBOx-style programmes.
In some cases, though, participants aren’t so time-constrained, and gains per week isn’t really the thing you care about —it seems like most participants would upskill faster from 5 weeks of ARENA + a few week-long sprints than 2 weeks of ARBOx + 3 weeks of ??? + a few week-long sprints. For what it’s worth: 14⁄19 ARBOx2 survey respondents reported they were happy with its length, 5⁄19 would have preferred a longer programme, and nobody wanted a shorter programme.