Running https://aiplans.org
Fulltime working on the alignment problem.
Or if anyone has better ideas for how to spread a meme of “don’t roll your own metaethics”[1], please contribute.
Don’t try to make your own religion
This is made even harder because, unlike in cryptography, there are no universally accepted “standard libraries” of philosophy to fall back on
what about pre written word religions that have survived the memetic battle of time?
AI safety researchers already have no leverage
This is really, really false.
Things AI researchers can do (ripping off the top of my head, at 4:08am):
contact local journalists, just talk plainly about their concerns
advise law firms on how to be more successful in suing ai companies—yes this affects things—this affects the most important thing in the next few months—money to ai companies
write official looking things like https://www.citriniresearch.com/p/2028gic, AI 2027, etc that make stocks go down and a bubble pop come sooner
And most of all:
think very hard, for at least 10 minutes, on what they think they know about how AI companies get money and what affects it—learn from LLMs how exactly AI companies get the money to do large training runs and then think hard for 10 minutes at least, on a clock, on what they can do to reduce this. If you have actually done this, have a google doc with notes that you can share, to show your thinking, and still think you have no leverage, send it to me, DM me and I will help. My signal is kabstastically.07
and yes, obviously, there’s things im not including here
Those things are also bad, this was more about companies programs prioritising recruiting people who look good vs actually being likely to lead to solving alignment.
have since changed my mind that this may be happening less than i thought.
and its more a capability issue than a values problem—of orgs not bragging as much about their weird recruits who do well but dont really have impressive looking things and programs/events that don’t require prestige to take part but also are valuable, not communicating this well.
No one’s done it yet. ControlAI is trying a bit. I plan to as well. And couple other orgs are trying a little, but no one really hard yet. Could make guessed as to why, but would be mostly speculation.
Alex Bores is someone I’m recommending as the number one place to donate to rn, to convert dollars to p doom reduction.
He’s a New York elected representative, author of the RAISE Act and another AI Safety bill.
Right now, seems to be the most competent person in the world at getting AI Safety bills made and passed.
A Super PAC sponsored in part by Palantir, where he used to work, is currently running attack ads against him.
Bernie himself is already quite popular and doesnt need that much help, imo—what is needed though, is to have other people be supported so that he’s not just one weirdo, but there’s an actual large scale movement with momentum, that will win the midterm elections.
This also involves helping make sure that the mid term elections actually happen and are fair and democratic.
status—typed this on my phone just after waking up, saw someone asking for my opinion on another trash static dataset based eval and also my general method for evaluating evals. bunch of stuff here that’s unfinished.
working on this course occasionally: https://docs.google.com/document/d/1_95M3DeBrGcBo8yoWF1XHxpUWSlH3hJ1fQs5p62zdHE/edit?tab=t.20uwc1photx3
Probably not. Is the thing something that actually has multiple things that could be causing it, or just one? An evaluation is fundamentally, a search process. Searching for a more specific thing and trying to make your process not ping for anything other than one specific thing, makes it more useful, when searching in the v noisy things that are AI models. Most evals dont even claim to measure one precise thing at all. The ones that do—can you think of a way to split that up into more precise things? If so, there’s probably at least n number of more things to split it into, n negatively correlating with how good you are at this.
Order of things about the model that tell useful data, from least useful data to most useful: Outputs Logits Activations Weights This is also the order for difficulty of measurements, but the data’s useful enough that any serious eval (there is one close to semi serious eval in the world atm, that I know of) engineers would try to solve this by trying harder.
There’s one group of people other than myself and people I’ve advised that are trying red teaming evals atm. Trying to do some red teaming of an eval at all isnt terrible necessarily, but likely the red teaming will be terrible and they’ll be vastly overconfident in the newness and value of their work from that tiny little bit of semi rigorous looking motion
We ask the AI to help make us smarter
what the fuck. somehow, this was out of my current expectations
this may make little to negative sense, if you don’t have a lot of context:
thinking about when I’ve been trying to bring together Love and Truth—Vyas talked about this already in the Upanishads. “Having renounced (the unreal), enjoy (the real). Do not covet the wealth of any man”. Having renounced lies, enjoy the truth. And my recent thing has been trying to do more of exactly that—enjoying. And ‘do not covet the wealth of any man’ includes ourselves. So not being attached to the outcomes of my work, enjoying it as it’s own thing—if it succeeds, if it fails, either I can enjoy the present moment. And this doesn’t mean just enjoying things no matter what—I’ll enjoy a path more if it brings me to success, since that’s closer to Truth—and it’s enjoying the Real.
the Real truth that each time I do something, I try something, I reach out explore where I’m uncertain, I’m learning more about what’s Real. There are words missing here, from me not saying them and being able to say them, from me knowing not how to say them and from me not knowing that they should be here, for my Words to be more True. But either way, I have found enjoyment in writing them.
the way you phrased it there seems fine
Too, too much of the current alignment work is not only not useful, but actively bad and making things worse. The most egregious example of this to me, is capability evals. Capability evals, like any eval, can be useful for seeing which algorithms are more successful in finding optimizers at finding tasks—and in a world where it seems like intelligence generalizes, this means that ever public capability eval like FrontierMath, EnigmaEval, Humanity’s Last Exam, etc help AI Capability companies figure out which algorithms are ones to invest more compute in, and test new algorithms.
We need a very, very clear differentiation between what precisely is helping solve alignment and what isn’t.
But there will be the response of ‘we don’t know for sure what is or isn’t helping alignment since we don’t know what exactly solving alignment looks like!!’.
Having ambiguity and unknowns due to an unsolved problem doesn’t mean that literally every single thing has an equal possibility of being useful and I challenge anyone to say so seriously and honestly.
We don’t have literally zero information—so we can certainly make some estimates and predictions. And it seems quite a safe prediction to me, that capability evals help capabilities much much more than alignment—and I don’t think they give more time for alignment to be solved either, instead, doing the opposite.
To put it bluntly—making a capability eval reduces all of our lifespans.
It should absolutely be possible to make this. Yet it has not been done. We can spend many hours speculating as to why. And I can understand that urge.
But I’d much much rather just solve this.
I will bang my head on the wall again and again and again and again. So help me god, by the end of January, this is going to exist.
I believe it should be obvious why this is useful for alignment being solved and general humanity surviving.
But in case it’s not:
If we want billions of people to take action and do things such as vote for candidates based on if they’re being sensible on AI Safety or not, you need to tell them exactly why. Do not do the galaxy brained idea of ‘oh we’ll have side things work, we’ll go directly to the politicians instead, we’ll trick people with X, Y, Z’. Stop that, it will not work. Speak the Truth, plainly and clearly. The enemies have skills and resources in Lies. We have Truth. Let’s use it.
If we want thousands of people to do useful AI Alignment research and usefully contribute to solving the problem, they need to know what it actually is. If you believe that the alignment problem can be solved with less than 1000 people, try this—make a prediction about how many researchers worked on the Manhattan Project. Then look it up.
If we want nation states to ally and put AI Safety as a top priority, using it as a reason to put sanctions take other serious actions against countries and parties making it worse—and put it ahead of short term profits—they need to know why!!
Good idea—I advise a higher amount, spread over more people. Up to 8.
Yep. ‘Give good advice to college students and cross subsidize events a bit, plus gentle pressure via norms to be chill about the wealth differences’ is my best current answer. Kinda wish I had a better one.
Slight, some, if any nudges toward politics being something that gives people a safety net, so that everyone has the same foundation to fall on? So that even if there are wealth differences, there aren’t as much large wealth enabled stresses
Sometime it would be cool to have a conversation about what you mean by this, because I feel the same way much of the time, but I also feel there’s so much going on it’s impossible to have a strong grasp on what everyone is working on.
Yes! I’m writing a post on that today!! I want this to become something that people can read and fully understand the alignment problem as best as it’s currently known, without needing to read a single thing on lesswrong, arbital, etc. V lucky atm, I’m living with a bunch of experienced alignment researchers and learning a lot.
also, happy to just have a call:
Did you see the Shallow review of technical AI safety, 2025? Even just going through the “white-box” and “theory” sections I’m interested in has months worth of content if I were trying to understand it in reasonable depth.
Not properly yet—saw that in the China section, Deepseek’s Speciale wasn’t mentioned, but it’s a safety review, tbf, not a capabilities review. I do like this project a lot in general—thinking of doing more critique-a-thons and reviving the peer review platform, so that we can have more thorough things.
China
Deepseek Special apparently performs at IMO gold level https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale—seems important
Control seems to largely be increasing p doom, imo, by decreasing the chances of canaries.
It connects you with people who can help you do better in the future.
Yes, it did!! Interviewed a lot of researchers in the prep for this, learnt a lot and met people, some of whom are now on the team and others who are also helping.
It teaches others not just about ways to succeed, but ways to fail they need to avoid.
Yup!! Definitely learnt a lot!
It encourages others to try, both by setting an example of things it is possible to attempt, and by reducing fear of embarrassment.
I hope so! I would like more people in general to be seriously trying to solve alignment! Especially in a way that’s engaging with the actual problem and not just prosaic stuff!
Thank you for all you do! I look forward to your future failures and hopefully future successes!
Thank you so much!! This was lovely to read!
One consideration is that we often find ourselves relying on free-response questions for app review, even in an initial screen, and without at least some of those it would be considerably harder to do initial screening.
Why not just have the initial screening only have one question and say what you’re looking for? So that the ones who happen to already know what you’re looking for aren’t advantaged and able to Goodhart more?
making metaethics, but not writing them down, only ever communicating them by talking to people in person, means that every single time, they risk being forgotten, dismissed, holes pointed out, etc. Especially since one will likely be discussing them in person with the same/similar people. Meaning that if they don’t have anything interesting/useful/compelling, people will be annoyed at you talking about them over and over again.
A big big vulnerability of this will be if one has mediocre-mediocre+ charisma/marketing skills and then starts talking about it on the internet, but that would violate the ‘only talk about it in person’ rule. An actual vulnerability though, that doesn’t break the rule is running into easily persuadable people, especially those who might share it with others and make permanent records, that then make it harder for you to forget your metaethics.
So optimum would be discussing it only in person with others who you can trust will never make a record of it and also be continually skeptical.