Lessons from a failed ambitious alignment program
I tried to organize a program where participants actually sincerely tried to solve the hard part of alignment, for up to 5 weeks. It went wrong a lot, largely due to fixable mistakes.
Good things about the program:
I learnt a lot about alignment in doing the prep, interviewed a lot of agent foundations researchers, learnt some neuromorality, talked to some neuroscientists, saw lots of what it takes to organize an event like this and have it go well.
Marketing went pretty well—got 298 applicants, ~50% of whom had phds, applicants included senior researchers/engineers from Nvidia, Meta, AMD, AWS, etc. And also feels like there’s a lot left on the table for the marketing, lots more I can do.
Also met senior researchers who gave a lot of useful feedback on my alignment research, one of whom I’m going to help get paying roles for their AI Safety knowledge, who’s also going to join the team—and another who’s joined the team and is leading our theory research rn.
I also made lots and lots of mistakes.
Essentially, it started 2 weeks late, quite disorganized, into the actual program, meaning that participation late was much, much, lower than signup rate—about 15 people actually took part, whereas 298 signed up.
The reasons for this and what I’m going to do to make sure it doesn’t happen again:
One:
- Promised that the first 300 applicants would be guaranteed personalized feedback. Thought that I could delegate to other, more technical members of the team for this.
However, it turned out that in order to give useful feedback and to be able to judge if someone was a good fit for the program, a person didn’t just need technical knowledge—they needed good communication skills, an understanding of what’s needed for alignment research, being consistently available for several hours a day, to actually go through the applications and being interested in doing so. Turned out that the only person who fit all those characteristics at the time was me. So I couldn’t delegate.
Also, a teammate made a reviewing software, which he said would help build a Bradley-Terry model of the applicants, as we reviewed them. I had a feeling that this might be overcomplicated but didn’t want to say no or react negatively to someone’s enthusiasm for doing free work for something I care about.
It turned out that constantly fixing, trying to improve, finangle with, etc, the software actually took several days. And it was faster to just do it manually.
What I’ll be doing next time to make sure this doesn’t happen:
- Only promising feedback to the first 50 applicants.
- Having preprepared lines for the rest, with the general reason they weren’t accepted—e.g. lack of suffifient maths experience without software engineering/neuroscience/philosophy to compensate, meaning that they might not be likely to get useful alignment theory work done in 5 weeks.
- Doing things manually, not experimenting with custom software last minute.
- Announcing the program more early—giving ourselves at least 3 months to prepare things.
Two:
- making the Research Guides for the different tracks turned out to be much, much, much harder than I thought it would be. Including for other, more technical teammates. Thought that making just a high level guide would be relatively ok, but instead turned out there was lots and lots of reading to do, lots and lot of preliminary reading and maths learning to do to understand that and it was very hard. This also delayed the start of the Moonshot Alignment Program a lot.
What I’ll be doing next time to make sure that this doesn’t happen:
- Starting out with just a reading list and links to things like https://www.alignmentforum.org/s/n7qFxakSnxGuvmYAX, https://drive.google.com/file/d/1aKvftxhG_NL2kfG3tNmtM8a2y9Q1wFHb/view (source with comments: https://www.lesswrong.com/posts/PRwQ6eMaEkTX2uks3/infra-exercises-part-1), etc
- Personally learning a lot more math
- Having a draft of the reading list—
Asking for help from more experienced alignment researchers, such as Gurkenglas, the_gears_of_ascension, Lorxus, etc, earlier
Major changes I’ve made since:
- brought on a teammate, who is looking to become a co founder, who is very competent, well organized, with a good technical foundation
- learnt more math and alignment philosophy
- brought much more technical people on the team—Gurkenglas and Roman Malov—who are also teaching me a lot and pointing out lots and lots of flaws in my ideas, updating me fast—especially Gurkenglas atm, since we’re living very close to each other
- changed my management style at ai plans—no longer having weekly progress meetings, trying to manage stuff myself on linear or other task management type stuff—instead, just having a one on one call with each teammate once a week, to learn about what they want to do, what problems they’re having, what is available for them to do and deciding what they’ll do
- moved to CEEALAR, much, much, much (1/2, total of 2.5) better for my stress, axiety, insecurity, mental health, etc.
- from this, also gotten friendships/contacts with more senior alignment researchers who i can and will ask for help
Interesting!
I think it would have been better, if you had to choose between starting on time and giving feedback to all rejected applicants, to do the former and drop the latter. Or, do it a month after the program—it is clearly not a part of the critical value-add of the program.
Failing ambitiously and visibly seems high value in that:
It connects you with people who can help you do better in the future.
It teaches others not just about ways to succeed, but ways to fail they need to avoid.
It encourages others to try, both by setting an example of things it is possible to attempt, and by reducing fear of embarrassment.
Thank you for all you do! I look forward to your future failures and hopefully future successes!
Yes, it did!! Interviewed a lot of researchers in the prep for this, learnt a lot and met people, some of whom are now on the team and others who are also helping.
Yup!! Definitely learnt a lot!
I hope so! I would like more people in general to be seriously trying to solve alignment! Especially in a way that’s engaging with the actual problem and not just prosaic stuff!
Thank you so much!! This was lovely to read!
Sometime it would be cool to have a conversation about what you mean by this, because I feel the same way much of the time, but I also feel there’s so much going on it’s impossible to have a strong grasp on what everyone is working on.
Did you see the Shallow review of technical AI safety, 2025? Even just going through the “white-box” and “theory” sections I’m interested in has months worth of content if I were trying to understand it in reasonable depth.
Yes! I’m writing a post on that today!! I want this to become something that people can read and fully understand the alignment problem as best as it’s currently known, without needing to read a single thing on lesswrong, arbital, etc. V lucky atm, I’m living with a bunch of experienced alignment researchers and learning a lot.
it’s here: https://docs.google.com/document/d/1rzJTwXT5CvF2IVKwlf72P-56bUZfwqxWO9tB8GU05Fo/edit?tab=t.w7ardgvsnt5g
also, happy to just have a call:
Not properly yet—saw that in the China section, Deepseek’s Speciale wasn’t mentioned, but it’s a safety review, tbf, not a capabilities review. I do like this project a lot in general—thinking of doing more critique-a-thons and reviving the peer review platform, so that we can have more thorough things.