Running https://aiplans.org
Fulltime working on the alignment problem.
what the fuck. somehow, this was out of my current expectations
this may make little to negative sense, if you don’t have a lot of context:
thinking about when I’ve been trying to bring together Love and Truth—Vyas talked about this already in the Upanishads. “Having renounced (the unreal), enjoy (the real). Do not covet the wealth of any man”. Having renounced lies, enjoy the truth. And my recent thing has been trying to do more of exactly that—enjoying. And ‘do not covet the wealth of any man’ includes ourselves. So not being attached to the outcomes of my work, enjoying it as it’s own thing—if it succeeds, if it fails, either I can enjoy the present moment. And this doesn’t mean just enjoying things no matter what—I’ll enjoy a path more if it brings me to success, since that’s closer to Truth—and it’s enjoying the Real.
the Real truth that each time I do something, I try something, I reach out explore where I’m uncertain, I’m learning more about what’s Real. There are words missing here, from me not saying them and being able to say them, from me knowing not how to say them and from me not knowing that they should be here, for my Words to be more True. But either way, I have found enjoyment in writing them.
the way you phrased it there seems fine
Too, too much of the current alignment work is not only not useful, but actively bad and making things worse. The most egregious example of this to me, is capability evals. Capability evals, like any eval, can be useful for seeing which algorithms are more successful in finding optimizers at finding tasks—and in a world where it seems like intelligence generalizes, this means that ever public capability eval like FrontierMath, EnigmaEval, Humanity’s Last Exam, etc help AI Capability companies figure out which algorithms are ones to invest more compute in, and test new algorithms.
We need a very, very clear differentiation between what precisely is helping solve alignment and what isn’t.
But there will be the response of ‘we don’t know for sure what is or isn’t helping alignment since we don’t know what exactly solving alignment looks like!!’.
Having ambiguity and unknowns due to an unsolved problem doesn’t mean that literally every single thing has an equal possibility of being useful and I challenge anyone to say so seriously and honestly.
We don’t have literally zero information—so we can certainly make some estimates and predictions. And it seems quite a safe prediction to me, that capability evals help capabilities much much more than alignment—and I don’t think they give more time for alignment to be solved either, instead, doing the opposite.
To put it bluntly—making a capability eval reduces all of our lifespans.
It should absolutely be possible to make this. Yet it has not been done. We can spend many hours speculating as to why. And I can understand that urge.
But I’d much much rather just solve this.
I will bang my head on the wall again and again and again and again. So help me god, by the end of January, this is going to exist.
I believe it should be obvious why this is useful for alignment being solved and general humanity surviving.
But in case it’s not:
If we want billions of people to take action and do things such as vote for candidates based on if they’re being sensible on AI Safety or not, you need to tell them exactly why. Do not do the galaxy brained idea of ‘oh we’ll have side things work, we’ll go directly to the politicians instead, we’ll trick people with X, Y, Z’. Stop that, it will not work. Speak the Truth, plainly and clearly. The enemies have skills and resources in Lies. We have Truth. Let’s use it.
If we want thousands of people to do useful AI Alignment research and usefully contribute to solving the problem, they need to know what it actually is. If you believe that the alignment problem can be solved with less than 1000 people, try this—make a prediction about how many researchers worked on the Manhattan Project. Then look it up.
If we want nation states to ally and put AI Safety as a top priority, using it as a reason to put sanctions take other serious actions against countries and parties making it worse—and put it ahead of short term profits—they need to know why!!
Good idea—I advise a higher amount, spread over more people. Up to 8.
Yep. ‘Give good advice to college students and cross subsidize events a bit, plus gentle pressure via norms to be chill about the wealth differences’ is my best current answer. Kinda wish I had a better one.
Slight, some, if any nudges toward politics being something that gives people a safety net, so that everyone has the same foundation to fall on? So that even if there are wealth differences, there aren’t as much large wealth enabled stresses
Sometime it would be cool to have a conversation about what you mean by this, because I feel the same way much of the time, but I also feel there’s so much going on it’s impossible to have a strong grasp on what everyone is working on.
Yes! I’m writing a post on that today!! I want this to become something that people can read and fully understand the alignment problem as best as it’s currently known, without needing to read a single thing on lesswrong, arbital, etc. V lucky atm, I’m living with a bunch of experienced alignment researchers and learning a lot.
also, happy to just have a call:
Did you see the Shallow review of technical AI safety, 2025? Even just going through the “white-box” and “theory” sections I’m interested in has months worth of content if I were trying to understand it in reasonable depth.
Not properly yet—saw that in the China section, Deepseek’s Speciale wasn’t mentioned, but it’s a safety review, tbf, not a capabilities review. I do like this project a lot in general—thinking of doing more critique-a-thons and reviving the peer review platform, so that we can have more thorough things.
China
Deepseek Special apparently performs at IMO gold level https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale—seems important
Control seems to largely be increasing p doom, imo, by decreasing the chances of canaries.
It connects you with people who can help you do better in the future.
Yes, it did!! Interviewed a lot of researchers in the prep for this, learnt a lot and met people, some of whom are now on the team and others who are also helping.
It teaches others not just about ways to succeed, but ways to fail they need to avoid.
Yup!! Definitely learnt a lot!
It encourages others to try, both by setting an example of things it is possible to attempt, and by reducing fear of embarrassment.
I hope so! I would like more people in general to be seriously trying to solve alignment! Especially in a way that’s engaging with the actual problem and not just prosaic stuff!
Thank you for all you do! I look forward to your future failures and hopefully future successes!
Thank you so much!! This was lovely to read!
One consideration is that we often find ourselves relying on free-response questions for app review, even in an initial screen, and without at least some of those it would be considerably harder to do initial screening.
Why not just have the initial screening only have one question and say what you’re looking for? So that the ones who happen to already know what you’re looking for aren’t advantaged and able to Goodhart more?
I’m making an AI Alignment Evals course at AI Plans, that is highly incomplete—nevertheless, would appreciate feedback: https://docs.google.com/document/d/1_95M3DeBrGcBo8yoWF1XHxpUWSlH3hJ1fQs5p62zdHE/edit?tab=t.20uwc1photx3
It will be sold as a paid course, but will have a pretty easy application process for getting free access, for those who can’t afford it
Glad to see this! Suggestion—if you have lots of things that you want to screen candidates for, people who value their time aren’t going to want to gamble it on an application that is high time cost and low chance of success—there is a way to solve this, by splitting it into stages.
E.g. Stage 1
- easy things for candidate—checkboxes, name, email, CV, link to works
Stage 2
“What do you expect to be the key factors affecting AI progress in the next ~5 years? ”
“Why do you think you’d be a good fit for this fellowship?”
“What sort of projects/topics would you want to work on?”
Stage 3
“Please pick ONE of the following questions to answer.”
Saves the candidates time and also makes things better for you, since you get to fast track people who seem especially promising at earlier stages and are more likely to get such candidates, since they don’t feel that their time is at risk of being wasted as much—and having the competence and care to notice this lets them positively update about you.
i’d prefer to read your own, rougher writing. or at least, to see the conversation, rather than just the output at the end
many such cases
The reasons for this and what I’m going to do to make sure it doesn’t happen again:
One:
- Promised that the first 300 applicants would be guaranteed personalized feedback. Thought that I could delegate to other, more technical members of the team for this.
However, it turned out that in order to give useful feedback and to be able to judge if someone was a good fit for the program, a person didn’t just need technical knowledge—they needed good communication skills, an understanding of what’s needed for alignment research, being consistently available for several hours a day, to actually go through the applications and being interested in doing so. Turned out that the only person who fit all those characteristics at the time was me. So I couldn’t delegate.
Also, a teammate made a reviewing software, which he said would help build a Bradley-Terry model of the applicants, as we reviewed them. I had a feeling that this might be overcomplicated but didn’t want to say no or react negatively to someone’s enthusiasm for doing free work for something I care about.
It turned out that constantly fixing, trying to improve, finangle with, etc, the software actually took several days. And it was faster to just do it manually.
What I’ll be doing next time to make sure this doesn’t happen:
- Only promising feedback to the first 50 applicants.
- Having preprepared lines for the rest, with the general reason they weren’t accepted—e.g. lack of suffifient maths experience without software engineering/neuroscience/philosophy to compensate, meaning that they might not be likely to get useful alignment theory work done in 5 weeks.
- Doing things manually, not experimenting with custom software last minute.
- Announcing the program more early—giving ourselves at least 3 months to prepare things.
Two:
- making the Research Guides for the different tracks turned out to be much, much, much harder than I thought it would be. Including for other, more technical teammates. Thought that making just a high level guide would be relatively ok, but instead turned out there was lots and lots of reading to do, lots and lot of preliminary reading and maths learning to do to understand that and it was very hard. This also delayed the start of the Moonshot Alignment Program a lot.
What I’ll be doing next time to make sure that this doesn’t happen:
- Starting out with just a reading list and links to things like https://www.alignmentforum.org/s/n7qFxakSnxGuvmYAX, https://drive.google.com/file/d/1aKvftxhG_NL2kfG3tNmtM8a2y9Q1wFHb/view (source with comments: https://www.lesswrong.com/posts/PRwQ6eMaEkTX2uks3/infra-exercises-part-1), etc
- Personally learning a lot more math
- Having a draft of the reading list—
Asking for help from more experienced alignment researchers, such as Gurkenglas, the_gears_of_ascension, Lorxus, etc, earlier
Major changes I’ve made since:
- brought on a teammate, who is looking to become a co founder, who is very competent, well organized, with a good technical foundation
- learnt more math and alignment philosophy
- brought a much more technical person on the team (Gurkenglas), who is also teaching me a lot and pointing out lots and lots of flaws in my ideas, updating me fast
- changed my management style at ai plans—no longer having weekly progress meetings, trying to manage stuff myself on linear or other task management type stuff—instead, just having a one on one call with each teammate once a week, to learn about what they want to do, what problems they’re having, what is available for them to do and deciding what they’ll do
- moved to CEEALAR, much, much, much (1/2, total of 2.5) better for my stress, axiety, insecurity, mental health, etc.
- from this, also gotten friendships/contacts with more senior alignment researchers who i can and will ask for help
I am so so sick of bluedot constantly doing false advertisement. this behaviour shouldn’t be rewarded
Daily meditation or reflection practice has something to offer on this front. So does the Quaker practice of silent worship. And so does the Jewish Sabbath.
my family’s daily prayer, which we are meant to have been showered for, where we all pray together, also brings this. taking this as a datapoint to continue it, even while i’m away from them and it’s less convenient. thanks
i think, reduction of abstractions
Maybe reduce the meaning of the ban from “you can’t reply to this person at all” to “you can only post one reply per article”? So you can state your objections, but you can’t stay there and keep interacting with the author. When you are writing the reply, you are notified that this is the only one you get under this article.
could be good, if combined with up to 3 total bans
We ask the AI to help make us smarter