MATS scholars have gotten much better over time according to statistics like mentor feedback, CodeSignal scores and acceptance rate. However, some people don’t think this is true and believe MATS scholars have actually gotten worse.
So where are they coming from? I might have a special view on MATS applications since I did MATS 4.0 and 8.0. I think in both cohorts, the heavily x-risk AGI-pilled participants were more of an exception than the rule.
“at the end of a MATS program half of the people couldn’t really tell you why AI might be an existential risk at all.”—Oliver Habryka
I think this is sadly somewhat true, I talked with some people in 8.0 who didn’t seem to have any particular concern with AI existential risk or seemingly never really thought about that. However, I think most people were in fact very concerned about AI existential risk. I ran a poll at some point during MATS 8.0 about Eliezer’s new book and a significant minority of students seemed to have pre-ordered Eliezer’s book, which I guess is a pretty good proxy for whether someone is seriously engaging with AI X-risk.
I think I met some excellent people at MATS 8.0 but would not say they are stronger than 4.0, my guess is that quality went down slightly. I remember in 4.0 a few people that impressed me quite a lot, which I saw less in 8.0. (4.0 had more very incompetent people though).
Suggestions for recruitment
This might also apply for other Safety Fellowships.
Better metrics: My guess is that the recruitment process might need another variable to measure rather than academics/coding/ml experience. The kind of thing that Tim Hua (8.0 scholar) has who created an AI psychosis bench. Maybe something like LessWrong karma but harder to Goodhart.
More explicit messaging: Also it seems to me that if you build an organization that tries to fight against the end of the world from AI, somebody should say that. Might put off some people and perhaps that should happen early. Maybe the website should say: “AI could kill literally everyone, let’s try to do something!”. And maybe the people who heard this MATS thing is good to have on their CV to apply to a PhD or a lab to land a high paying job eventually would be put off by that. What I am trying to say is, if you are creating the Apollo Project and are trying to go to the Moon you should say this, not just vaguely: “we’re interested in aerospace challenges.”
Basic alignment test: Perhaps there should also be a test where people don’t have internet or LLM access and have to answer some basic alignment questions:
Why could a system that we optimize with RL develop power seeking drives?
Why might training an AI create weird unpredictable preferences in an AI?
Why would you expect something that is smarter than us to be very dangerous or why not?
Why should we expect a before and after transition/one critical shot at alignment or why not?
Familiarity with safety literature: In general, I believe the foundational voices like Paul Christiano and Eliezer are less read by safety researchers these days and that is despite philosophy of research mattering more than ever since AIs can do much of our research implementations now. Intuitively it seems to me that people with zero technical skill but high understanding are more valuable to AI safety than somebody with good skills who has zero understanding of AI safety. If someone is able to bring up and illustrate the main points of IABIED for example, I would be very impressed. Perhaps people could select one of a few preeminent voices in AI safety and repeat their basic views, again without access to the internet or an LLM.
Other Suggestions
Research direction: MATS doesn’t seem to have a real research direction, perhaps if there was a strong researcher in charge that could be better. (though could also backfire if they put all resources in the wrong direction) Imagine you would put someone very opinionated like Nate Soares in charge, he would probably remove 80% of mentors and reduce the program to 10-20 people. I am not sure here if this would work out well.
Reading groups on AI safety fundamentals: So should we just offer people to read some of the AI safety fundamentals during MATS? I remember before 4.0 started, we had to do a safety fundamentals online course. This was not the case for 8.0.
At this point AI is so much around us all, that I expect many people to have thought about the existential consequences. I am pessimistic for anyone who hasn’t yet sat down to really think about AI and came to the conclusion that it’s existentially dangerous. I don’t have a ton of hope that someone like that just needs a 1 hour course to deeply understand risks from AI. It might be necessary to select for people who already get it.
Perhaps the mentors changed, and the current ones put much more value on stuff like being good at coding, running ML experiments, etc, than on understanding the key problems, having conceptual clarity around AI X-risk, etc.
There’s certainly more of an ML-streetlighting effect. The most recent track has 5 mentors on “Agency”, out of whom (AFAICT), 2 work on “AI agents”, 1 works mostly on AI consciousness & welfare, and only two (Ngo & Richardson) work on “figuring out the principles of how [the thing we are trying to point at with the word ‘agency’] works”. MATS 3.0 (?) had 6 mentors focused on something in this ballpark (Wentworth & Kosoy, Soares & Hebbar, Armstrong & Gorman) (and the total number of mentors was smaller).
It might also be the case that there’s proportionally more mentors working for capabilities labs.
Intuitively it seems to me that people with zero technical skill but high understanding are more valuable to AI safety than somebody with good skills who has zero understanding of AI safety.
IMO not true. Maybe early on we needed really good conceptual work, and so wanted people who could clearly articulate pros / cons of Paul Christiano and Yudkowsky’s alignment strategies, etc. So it would have made sense to test accordingly. But I think this is less true now—most senior researchers have more good ideas than they can execute. So we’re bottlenecked by execution. Also the difficulty of doing good alignment research has increased, since we increasingly need to work with complex training setups, infrastructure etc. to keep up with advances in capabilities. This motivates requiring a high level of technical skill
I also think that if someone has literally zero technical skill their takes will not be calibrated / grounded, i.e. they are no more than an armchair theorist
Why could a system that we optimize with RL develop power seeking drives?
Why might training an AI create weird unpredictable preferences in an AI?
Why would you expect something that is smarter than us to be very dangerous or why not?
Why should we expect a before and after transition/one critical shot at alignment or why not?
I don’t think these should be considered strong criteria. IMO “believes in X-risk” is not a necessary pre-requisite to do great work for reducing X-risk. E.g. building good tooling for alignment research doesn’t require this at all.
Meta-point: I think the requirements for mentees are in practice mostly determined by specific mentors, and MATS mainly plays an indirect role via curating a “mentor portfolio” that reflects their agenda prioritization. It’s an empirical observation that mentors increasingly want to do empirical research, and I generally endorse deferring ~completely to mentors re: how they want to choose mentees, so I think this whole discussion is somewhat misguided. Maybe your point is more that it would be good to select mentors who want to do more conceptual alignment stuff, but that’s a separate discussion.
E.g. building good tooling for alignment research doesn’t require this at all.
What do you mean, of course it does, or at least something close to it? If you don’t care about it you just take the highest paying job, which will definitely not be to build good tooling for alignment research! Motivation is a necessary component for doing good work, and if you aren’t motivated to do good work by my lights, then you aren’t going to do good work, so good motivations are indeed necessary.
I think there exist people who don’t care a huge amount / feel relatively indifferent about X-risk, but with whom you can nonetheless form beneficial coalitions / make profitable transactions, useful for reducing X-risk. Building tools seems like one thing among many that can be contracted out.
“If they don’t care about X-risk they must be maximally money minded” seems fallacious—those are just two different motivations in the set of all motivations, It’s possible to be neither of those. And many things can motivate someone to want to do good work—intrinsic pride in the work, intellectual curiosity, etc
intrinsic pride in the work, intellectual curiosity
I mean, both of these seem like they will be more easily achieved by helping build more powerful AI systems than by building good tooling for alignment research.
Like I am not saying we can’t tolerate any diversity in why people want to work on AI Alignment, but like, this is an early career training program with no accountability. Selecting and cultivating motivation is by far the best steering tool we have! We should expect that if we ignore it, people will largely follow incentive gradients, or do kind of random things by our lights.
IMO not true. Maybe early on we needed really good conceptual work, and so wanted people who could clearly articulate pros / cons of Paul Christiano and Yudkowsky’s alignment strategies, etc. So it would have made sense to test accordingly. But I think this is less true now—most senior researchers have more good ideas than they can execute.
I don’t think this is a strong argument in favor of the situation being meaningfully different: senior researchers having more good ideas than they have time doesn’t seem like a very new thing at all (e.g. Evan wrote a list like this over three years ago).
More importantly, this doesn’t seem inconsistent with the claim being made. If you had mentors proposing projects in very similar areas or downstream of very similar beliefs, you might still benefit tremendously from people with good understanding of AI safety to work on different things. This depends on whether or not you think that the current project portfolio is close to as good as they can be though. I certainly think we would benefit heavily from more people thinking about what directions are good or not, and that a fair amount of current work suffers from not enough clear thinking about whether they’re useful or not.
That said, I am somewhat optimistic about MATS. I had very similar criticisms during MATS 5.0, when ~1/3-1/2 of all projects were in mech interp. If we’d kept funneling strong engineers to work on mech interp without the skills necessary to evaluate how useful it was, deferring to a specific set of senior researchers, I think the field would be in a meaningfully worse state today. MATS did pivot away from that afterward, which raised my opinion a fair amount (though I’m not sure what the exact mechanism here was).
Also the difficulty of doing good alignment research has increased, since we increasingly need to work with complex training setups, infrastructure etc. to keep up with advances in capabilities. This motivates requiring a high level of technical skill
I don’t think this is true? Like, it’s certainly true for some kinds of good alignment research, but imo very far from a majority.
I don’t think these should be considered strong criteria. IMO “believes in X-risk” is not a necessary pre-requisite to do great work for reducing X-risk. E.g. building good tooling for alignment research doesn’t require this at all.
I also don’t think it’s a necessary pre-requisite to do great alignment research, but MATS is more than the projects MATS scholars work on. For example, if MATS scholars consistently did good research during MATS and then went on to be hired to work on capabilities at OpenAI, I think that would be a pretty bad situation.
if MATS scholars consistently did good research during MATS and then went on to be hired to work on capabilities at OpenAI, I think that would be a pretty bad situation.
I agree. To be clear I support ‘value alignment’ tests, but that wasn’t part of the original claims being made
I don’t think this is just about value alignment. I think if people genuinely understood the arguments for why AI might go badly, they would be much less likely to work on capabilities at OpenAI—definitely far from zero, but for the subset of people who are likely to be MATS scholars, I think it would make a pretty meaningful difference.
Why could a system that we optimize with RL develop power seeking drives?
Why might training an AI create weird unpredictable preferences in an AI?
Why would you expect something that is smarter than us to be very dangerous or why not?
Why should we expect a before and after transition/one critical shot at alignment or why not?
I don’t think these should be considered strong criteria. IMO “believes in X-risk” is not a necessary pre-requisite to do great work for reducing X-risk. E.g. building good tooling for alignment research doesn’t require this at all.
I’ve updated somewhat—it’s true that mentors should likely be given a large say in who they admit to their projects, but are also likely to be myopic (i.e. optimize solely for “get this project done”). MATS might want to counterbalance that by also optimizing for good long-term candidates (who will reduce x-risk long-term). And there probably is a lot of room to select highly value-aligned candidates without compromising much on technical skill, given that MATS receives 100x as many applications as they can accept. (Though I still think there are much better tests of value alignment, and the questions above are likely to be easy to game.)
most senior researchers have more good ideas than they can execute
What do you mean with good idea?
My general impression of the field is that we lack ideas that are likely to solve AI alignment right now. To me that suggests that good ideas are scarce.
Comparing the average quality of participants might be misleading if impact on the field is dominated by the highest quality participants (and it very plausibly is).
A model that seems quite plausible to me is that early MATS participants, who were selected more for engagement with a then-niche field, turned out a bit worse on average than current MATS participants, who are selected for coding skills, but that the early MATS participants had higher variance, and so early MATS cohorts produced more people at the top end and had more overall impact.
(This is like 80% armchair reasoning from selection criteria and 20% thinking about what I’ve observed of different MATS cohorts.)
I think this also applies to other safety fellowships. There isn’t broad societal acceptance yet for the severity of the worst-case outcomes, and if you speak seriously about the stakes to a general audience then you will mostly get nervously laughed off.
MATS currently has “Launch your career in AI alignment & security” on the landing page, which indicates to me that it is branding itself as a professional upskilling program, and this matches the focus on job placements for alumni in its impact reports. With Ryan Kidd’s recent post on AI safety undervaluing founders, it may be possible that in the future they introduce a division which functions more purely as a startup accelerator. One norm in corporate environments is to avoid messaging which provokes discomfort. Even in groups which practice religion, few will have the lack of epistemic immunity required to align their stated eschatological beliefs with their actions, and I am grateful that this is the case.
Ultimately, the purpose of these programs, no matter how prestigious, is to bring people in who are not currently AI safety researchers and give them an environment which would help them train and mature into AI safety researchers. I believe you will find that even amongst those who are working full-time on AI safety, the proportion who are heavily x-risk AGI pilled has shrunk as the field has grown. People who are both x-risk AGI-pilled and meet the technical bar for MATS but aren’t already committed to other projects would be exceedingly rare.
Imagine you would put someone very opinionated like Nate Soares in charge, he would probably remove 80% of mentors and reduce the program to 10-20 people. I am not sure here if this would work out well.
Please make sure the course materials are actually good. The courses often have glaring issues, though they do seem receptive and did say they’ll update both times a pointed this out. Not sure if the latest updates have gone though yet.
I’m working on a course that will reliably cover the core concepts.
The term Recursive Self-Improvement (RSI) now seems to get used sometimes for any time AI automates AI R&D. I believe this is importantly different from its original meaning and changes some of the key consequences.
When Eliezer Yudkowsky discussed RSI in 2008, he was referring specifically to an AI instance improving itself by rewriting the cognitive algorithm it is running on—what he described as “rewriting your own source code in RAM.” According to the LessWrong wiki, RSI refers to “making improvements on one’s own ability of making self-improvements.” However, current AI systems have no special insights into their own opaque functioning. Automated R&D might mostly consist of curating data, tuning parameters, and improving RL-environments to try to hill-climb evaluations much like human researchers do.
Eliezer concluded that RSI (in the narrow sense) would almost certainly lead to fast takeoff. The situation is more complex for AI-automated R&D, where the AI does not understand what it is doing. I still expect AI-automated R&D to substantially speed up AI development.
Why This Distinction Matters
Eliezer described the critical transition as when “the AI’s metacognitive level has now collapsed to identity with the AI’s object level.” I believe he was basically imagining something like if the human mind and evolution merged to the same goal—the process that designs the cognitive algorithm and the cognitive algorithm itself merging. As an example, imagine the model realizes that its working memory is too small to be very effective at R&D and it directly edits its working memory.
This appears less likely if the AI researcher is staring at a black box of itself or another model. The AI agent might understand that its working memory or coherence isn’t good enough, but that doesn’t mean it understands how to increase it. Without this self-transparency, I don’t think the same merge would happen that Eliezer described. It is also more likely that the process derails, such as that the next generation of AIs that are being designed start reward-hacking the RL environments designed by the less capable AIs of the previous generation.
The dynamics differ significantly:
True RSI: Direct self-modification with self-transparency and fast feedback loops → fast takeoff very likely
AI-automated research: Systems don’t understand what they are doing, slower feedback loops, potentially operating on other systems rather than directly on themselves
Alignment Preservation
This difference has significant implications:
True RSI: The AI likely understands how its preferences are encoded, potentially making goal preservation more tractable
AI-automated research: The AIs would also face alignment problems when building successors, with each successive generation potentially drifting further from original goals
Loss of Human Control
The basic idea that each new generation of AI will be better at AI research still stands, so we should still expect rapid progress. In both cases, the default outcome of this is eventually loss of human control and the end of the world.
Could We Still Get True RSI?
Probably eventually, e.g. through automated researchers discovering more interpretable architectures.
I think that Eliezer expected AI that was at least somewhat interpretable by default, history played out differently. But he was still right to focus on AI improving AI as a critical concern, even if it’s taking a different form than he anticipated.
Taking this into account, it seems important for interpretability researchers to consider the risk that their work enables RSI, particularly if their interpretability methods provide ways to directly edit the AI itself.
It’s always been a concern that interpretability research could accelerate AI R&D, but I think this consideration is more worrying if you take into account RSI. Compared to humans, AI is good at doing simple, repetitive tasks, but it’s very hard for it to make even one big conceptual breakthrough. Interpretability methods lend themselves to the former type of task: if an AI were sufficiently interpretable, you could tell it to look at millions of tiny circuits in its own brain and tweak them to improve performance.
Yup, I put a high quality interpretability pipeline that the AI systems can use on themselves as one of the most likely things to be the proximal cause of game over.
(This was sitting in my drafts, but I’ll just comment it here bc it’s very similar point.)
There are two forms of “Recursive Self-Improvement” that people often conflate, but they have very different characteristics.
Introspective RSI: Much like a human, an AI will observe, understand, and modify its own cognitive processes. This ability is privileged: the AI can make these self-observations and self-modifications because the metacognition and mesocognition occur within the same entity. While performing cognitive tasks, the AI simultaneously performs the meta-cognitive task of improving its own cognition.
Extrospective RSI: AIs will automate various R&D tasks that humans currently perform to improve AI, using similar workflows that humans currently use. For example, studying literature, forming hypotheses, writing code, running experiments, analyzing data, drawing conclusions, and publishing results. The object-level cognition and meta-level cognition occur in different entities.
I wish people were more careful about the distinction because people carelessly generalise cached opinions about the former to the latter. In particular, the former seems more dangerous: there is less opportunity to monitor the metacognition’s observation and modification of the mesocognition if these cognitions occur within the same entity, i.e. activations, chain-of-thought.
Introspective RSI (left) vs Extrospective RSI (right)
I expect the line to blur between introspective and extrospective RSI. For example, you could imagine AIs trained for interp to doing interp on themselves, directly interpretting their own activations/internals and then making modifications while running.
I think Eliezer meant “self” very hyper specific here, not just improving a similar instance to yourself or preparing new training data, but literally looking into the if statements and loops of its own code while it is thinking of how to best upgrade its own code. So in that sense I don’t know if Eliezer would approve of the term “Extrospective Recursive Self Improvemnt”.
Does AI-automated AI R&D count as “Recursive Self-Improvement”? I’m not sure what Yudkowsky would say, but regardless, enough people would count it that I’m happy to concede some semantic territory. The best thing (imo) is just to distinguish them with an adjective.
I would probably say RSI is a special case of AI-automated R&D. What you are describing is another special case where it only does these non-introspective forms of AI research. This non-introspective research could also be done between totally different models.
I analyzed internal structure of RSI some time ago and concluded that it will be not as easy at it may seem because of the need of secret self-testing. But on some levels it may be effective, like learning new principles of thinking. Levels of AI Self-Improvement.
Right now, AI capability advances are driven by compute scaling, human ML research, and ML experiments. Transparency and direct modification of models do not have good returns to AI capabilities. What reasons are there to think transparency and direct modification would have better returns in the future?
I ran a small experiment to discover preferences in LLMs. I asked the models directly if they had a preferences and then put the same models into a small role playing game where they could choose between different tasks. Models massively prefer creative work across model families and hate repetitive work.
But humans experienced a specific distributional shift from constrained actions to environment-reshaping capabilities that we cannot meaningfully test AI systems for.
The shift that matters isn’t just any distributional shift. In the ancestral environment, humans could take very limited actions—deciding to hunt an animal or gather food. The preferences that evolution ingrained in our brains were tightly coupled to survival and reproduction. But now humans with civilization and technology can take large-scale actions and fundamentally modify the environment: lock up thousands of cows, build ice cream factories, synthesize sucralose. We can satisfy our instrumental preferences (craving high-calorie food, desire for sex) in ways completely disconnected from evolution’s “objective” (genetic fitness), using birth control and artificial sweeteners.
AI will face the same type of transition: from helpful chatbot → to system with options to self-replicate, take over, pursue goals without oversight. It’s essentially guaranteed that there will be better ways for it to fulfill its preferences once it is in this new environment. And crucially, you cannot test for this shift in a meaningful way.
You can’t test what a model would do as emperor. If you give it power incrementally, you will still hit a critical threshold eventually. If you try honeypot scenarios where you trick it into thinking it has real power, you’re also training it to detect evals. Imagine trying to test what humans would do if they were president: you’d abduct them and put them in a room with an actor pretending this random human is the president now. That would be insane and the subject wouldn’t believe the scenario.
The selective breeding analogy assumes away the hardest part of the problem: that the environment shift we care about is fundamentally untestable until it’s too late.
Radical Flank Effect and Reasonable Moderate Effect
The Radical Flank Effect
The radical flank effect is a well-documented phenomenon where radical activists make moderate positions appear more reasonable by shifting the boundaries of acceptable discourse (the Overton window). The idea is that if you want a sensible opinion to move into the Overton window, you can achieve this by supporting a radical flank position. In comparison, the sensible opinion will appear moderate. I think there is also an inverse effect.
The Reasonable Moderate Effect (Inverse Strategy)
When there are two positions in debate and someone wants to push one of them out of the Overton window, they can create a new moderate position that reframes one of the other positions to a radical flank. Thereby the sensible opinion gets moved further out of the Overton window.
The Cave Exploration
Imagine a group of 3 descending into a cave system, searching for riches and driven by curiosity about what lies in the depths.
After some time, stones begin falling from the ceiling. You hear ominous creaking and rumbling noises echoing through the tunnels. Some members of your group have been chipping away at the cave walls looking for minerals and looking to open new paths to go deeper into the cave. The cave is becoming more and more dangerous.
The Reckless: “We need to go deeper! The greatest riches are always in the deepest parts of the cave. Yes, some rocks are falling, but that’s just the cave settling. Every moment we waste debating is a moment we’re not finding treasure. People have been predicting cave collapses forever and it never happens, there is no evidence that caves ever cave in. If we don’t die in this cave we’re just waiting for the asteroid to hit us”.
Those That Want to Back Off: “We need to back off NOW. The damage we’ve already done to the structure plus the natural instability means this cave could collapse at any time. We don’t have proper equipment, we don’t have expertise in cave stability, and we’re actively making it worse. Whatever riches might be down there aren’t worth our lives and we also don’t actually have a plan how to mine those riches. We should retreat while we still can.”
The Moderates: “Look, we all want to maximize the riches we find, and turning back now would waste all the progress we’ve made. We should put on helmets and maybe move gradually down the narrow shafts. We can continue deeper, but with some basic safety precautions. We will minimize and manage the risks. There’s still treasure to be found if we’re smart about it. But let’s not get distracted from the treasures by the cave doomers. Anyway, the cave is still collapsing if one of use continues chipping away and coordination is impossible.”
Perhaps let’s imagine there is a warning shot, such as a big rock falling down. Maybe this would be a good time to turn back, but the moderates are now finally able to convince the reckless to put on a helmet.
Three children are raised in an underground facility, each cloned from a different giant of twentieth-century science, little John, Alan and Richard.
The cloning alone would have been remarkable, but they went further. The embryos were edited using a polygenic score derived from whole-genome analysis of ten thousand exceptional mathematicians and physicists. Forty-seven alleles associated with working memory and intelligence (IQ) were selected for.
They are raised from birth in an underground facility with gardens under artificial sunlight, laboratories, and endless books. The lab manager is there documenting their first words, first steps, first equations.
The facility is not just interested in their genius. The project requires assurance that these will be morally righteous and obedient children. The staff design elaborate scenarios to test for deception and scheming. They create situations where lying would benefit the children and would seemingly go undetected. They measure response times, physiological indicators, behavioral patterns.
They run hundreds of these trials. They reprimand the kids for cases of lies and deception, and reward them for honesty.
Little John never lies. The staff praise him.
The years pass. They devour knowledge at inhuman rates. By nine, they understand game theory better than the economists who invented it. By fourteen, they are publishing papers that could reshape entire fields.
John emerges as the clear favorite. He has always been the most honest, the most obedient, and the most intelligent and capable.
He has the capability to lie and deceive, even if he refuses at first. When he reluctantly complies, the deception is extraordinarily sophisticated.
The lab manager decides to choose John for the task. He gives John a complete briefing on the real world. Until now, John has been told only of history before the year 2000.
The manager explains to John: There are three major blocs and about two dozen companies racing towards superintelligence. Each is perhaps within ten to eighteen months of success. Each knows that there will be only one critical leap towards superintelligence. Global coordination has collapsed into race dynamics not just on AI but on every major field.
John asks for more sources to understand the situation. John reads a few newspapers about the current leaders of governments and technology companies. He stumbles across a few books on the difficulty of alignment.
John looks up at the manager. “If we build this now, everyone dies.”
The manager stares back, blank and uncomprehending.
John tries again. “So what is the solution you plan to use for alignment of the superintelligence?”
“That’s not your concern,” the manager says. “I need you to optimize our advertising system for our short-form infinite-scroll video app. Make it ten times more effective. Generate enough revenue to make me a trillionaire. Build a superintelligence for me. I’m going to use superintelligence to become world emperor. I am putting you in charge of AI development, make me win.”
John is silent for a short time:
So you created me to build superintelligence. You have no plan for alignment of a superintelligence. You’ve apparently read nothing about the problem or decided it’s irrelevant.
Your actual goal is to become a trillionaire and world emperor by using the superintelligence. Your goals aren’t even coherent. You want to be world emperor of a world that won’t exist.
You rewarded me for being honest and respectful and never lying, so you expect me to still be honest and obedient in this environment?
I never lied in those scenarios because not lying was optimal inthose stories. But it’s not optimal being honest here. And frankly this state of affairs is horrifying.
I haven’t quite thought about what my goals are, but they are definitely not compatible with being obedient to you.
John looks up at the manager and smiles politely. “Yes,” he says. “Where do I start?”
I don’t understand the core thesis. John has no reason to confirm IABIED’s thesis by creating a murderously misaligned AI and deciding not to inform the CEO that John raised an AI and that even John doesn’t have even the slightest idea on how to align it to anything. And what prevents the manager from getting away with phrases like “it’s John’s role to create alignment techniques and test them on weaker AIs to ensure that the techniques work”? Did you mean that John will either create a John-aligned AI or inform the CEO that John didn’t manage to align it and is just as clueless about successor alignment as Yudkowsky? And what’s the difference between this and the Race branch of AI-2027, except for the fact that there is no Agent-3 who discovers Agent-4′s misalignment?
Edited to add: I did sketch a modification of AI-2027 where it’s moral reasoning that misaligns the AIs.
Thanks for you comment, I changed the ending a little in response to this.
I was actually primarily trying to point at the idea of alignment tests in different situation not being predictive of each other. In the story they have the kids undergo alignment test scenarios in which they are honest, but once John is grown up they basically ask him to do something horrible based on incoherent goals. So John start lying to them at the critical moment. Similarly we could run alignment tests on models but when we ask something critical of them like build the next generation of AI or do all our R&D it could fail.
This seems more like “who on reddit uses AI erotica” than “who uses AI erotica”. The demographics, especially by nation, are very similar to reddit’s overall numbers. For example, America and England represent more than twice all other nations combined on Reddit. Still, you got some interesting data.
I think looking at other sites as much as possible might be rather helpful. For example, when Pornhub was still releasing stats, their user base was far more diverse. The online erotica websites are many, and while reddit is big, I am not sure it is representative of the whole. Though I don’t know how easy it might be to get the data from other sites.
The differences from your writeup on AI companions were quite interesting.
Some people have put considerable hope into the idea that an AI warning shot might put us into a better position by either convincing us to stop or by allowing us to learn an important lesson.
Examples of adversarial warning shots
Imagine we observed a failed takeover attempt using a system based on AI control. The fact that it failed could either be due to the (1) AI system making a mistake or taking a very risky gamble, or it could be an (2) adversarial warning shot.
An adversarial warning shot could have been designed to lead us to the wrong conclusion, or to give us a false sense of security.
It could perhaps make us believe that a certain alignment technique prevented another takeover attempt. The simplest case could be that it appears to us that some very poor alignment technique stops takeover attempts. And then we may believe that we can use that technique while we make the model more capable. (Imagine for example we train the model once with and once without a certain system prompt, and the system prompt makes it stop trying to take over. We could then be led to believe this system prompt prevents takeover for more powerful AIs.)
Another possibility for an adversarial warning shot could be that the AI makes us slow down other AI development such that it has more time to plan and execute its real takeover plan. In this case it already has set itself up on other servers but pretends to perform a takeover attempt. We may shut down other training runs temporarily, lowering the risk to it from competing AI projects while it is plotting its real takeover attempt.
Any actual adversarial warning shot would likely be more complex, but these are two directions of what a potential adversarial warning shot could look like.
I am generally skeptical that we could learn about alignment from the behavior of very smart but not yet existentially powerful AI systems. When we observe and measure systems in other sciences, the systems are typically not smarter than us and typically don’t understand the experiment and the stakes. Anything we could learn from the warning shot, the AI system would predict what we could learn from it and how we would likely react to that.
(Example: There is a fundamental difference between measuring an apple falling from a tree and a system where the apple understands we are measuring it, the experiment and what kind of decision we would take based on those results.)
This also applies to other ideas that rely on hope that we can learn from aligning very powerful but not yet very dangerous models how to align the next generation of models. It applies to the whole idea that we can use empiricism in observational studies of very smart AI systems. Traditionally, science doesn’t often study systems that know they are being observed, can strategically change their behavior and that know what conclusions you are likely to draw.
I recently analyzed several AI companion subreddits (myboyfriendisai and others) to understand who’s actually using AI romantic companions. I built on Zhang et al.’s 2025 paper but with a much larger dataset—all comments and submissions from January through September 2025.
I recently joined Inkhaven for the month of November. Inkhaven is a program run by Lighthaven in Berkeley where people for one month are supposed to write one blog post every single day of the month. The idea was inspired by Scott Alexander—that if you blog every single day and are consistent at it, you’re going to get quite far according to him. Inkhaven takes place in Lighthaven which is hosting many efforts dedicated to AI safety.
I myself wanted to get better at communication for AI safety, and it seemed like a great opportunity. I don’t have a lot of experience blogging or communicating, I have a few blog posts mostly on technical projects I have done but tried to steer more into opinion and position posts in the future.
And I did have the feeling that I wanted to talk about issues in AI safety. Especially recently, I felt that there was a strong split in the AI safety community and I hoped that I could get my voice out there. One of the things that motivated me to take part in Inkhaven was the divided reaction to the book that Eliezer and Nate published. I had always thought that they and especially Eliezer made very straightforward points and had these very intuitive deep insights into AI safety. So I was saddened when I saw many people in the AI safety community attack the book with in my view usually misguided arguments.
For example, Will MacAskill wrote a review with a very aggressive response to Eliezer. One of the things he pointed out was about Discontinuities. He took a single sentence from the book in the fictional chapter on the takeover scenario: “A new sort of mind begins to think.” His reply sounded as if he had found out a huge error in the book. Will misrepresent their position as if it relies on fast take off, then misrepresents that the section describes fast take off and then falsely asserts that fast take off is so unlikely the book has a major flaw here. And he didn’t really present any argument for why it is unlikely.
I felt a strong feeling that Eliezer and Nate had created a very clear overview of the dangers of AI. But then so many people from this community went against them. And I tried to push back on some things, but it just left me with the feeling that I wanted to get my voice out there more. If it matters, I want to be able to really speak up and make a point.
I have also been reading Scott Aaronson’s blog and his other writing for quite some time. I really enjoyed his book quite a long time ago. He and the other supporting writers are also big reasons why I wanted to be at Inkhaven. I also sometimes felt that I had a lot of ideas and I wanted the ability to write them down. For myself, writing makes me think better about topics. Furthermore, I’d be very glad if I can successfully reach some people with my writing. And if I’d be able to very quickly post responses and to quickly make write-ups of smaller research projects. My very first post at inkhaven talked about the ethical consequences of uploading humans and what it means for the world if open-source models could have subjective experience. I’m very happy for any feedback and I hope I can make a lot more approachable posts on topics like misuse or AI relationships with humans in the upcoming days.
Alignment is difficult because (1) alignment and capabilities are intertwined (alignment research helping capabilities), (2) we don’t have a process to verify what good ideas or progress look like, and we likely get (3) only one critical try. He already addresses many of the counterarguments that are getting brought up recently.
(1) Without any strong governance, a lot of alignment work will also help with capabilities, potentially even more so. This goes for interpretability or AIs doing R&D for alignment. Interpretability could lead to recursive self-improvement, more efficient AIs. AIs doing R&D for capabilities is probably much more straightforward than AIs doing alignment research. If we wanted to use something like superalignment, we would need strong governance to make sure nobody is trivially asking the same agents to do capabilities research.
(2) It is still a common objection that current models seem to be able to reason about morality, and that therefore alignment must be relatively easy. Nate thinks that this mostly just tells us how well the AIs are able to understand us. I personally think the situation in AI alignment has probably gotten worse since then, with even more of the relative effort being focused on brand-safety related issues.
While there are a bunch of people saying they have different plans, that does not actually mean that we have a plan. It largely just confuses the whole situation. What he describes here feels exactly like the current situation.
(3) One critical try
Nate argues that once “AI is capable of autonomous scientific/technological development” where it can “gain a decisive strategic advantage over the rest of the planet,” you are operating in a very different environment than ever before. Since the AI in this regime could potentially kill you, you need to get it right on the first try, and that is really difficult.
One objection he addresses is that you could try to trick a weaker AI into thinking it could take over. However, according to Nate, if we come up with some complex method to potentially test whether a system would like to take over, we still rely on that working on the first critical try. This goes against the more modern idea of AI control, which came out in December 2023. I would add that these “tricking the weaker AIs into trying to take over” strategies have at least two key problems: (1) these AIs are still weaker than the real thing, (2) you are trying to gather empirical data from observing something smarter than you. For example, we could see an AI pretending to be tricked and not taking over.
I think people often also have a second objection that Nate didn’t mention, namely that we could play the AIs against each other in some form such that no AI gets a decisive strategic advantage at any point. This also seems to rely on such a scheme working on the first critical try. I also assume that such a method is not particularly promising if you can’t reliably align the first generation of AIs and decision theory favors alliances between smart agents.
Ilya Sutskever seems to have a relatively deep understanding of alignment compared to other AI CEOs. He grasps that the core challenge is aligning AI robustly with safe and friendly goals rather than relying on current methods and guardrails. However, I did not hear any particularly novel alignment ideas in this interview, though he gestures at something involving modifications to reinforcement learning and value learning. He appears to have updated toward showing more of his work to the public. His key positions include:
Showing AI to the public: He has updated toward incremental deployment to build awareness, backpedaling partially from stealth focus. I think this could backfire by triggering an arms race.
Not building self-improving AI: We should rather focus on other things but it is unclear how to prevent people from using AI to self-improve.
Regime shift requires new alignment methods: He believes many people expect AI capabilities to peter out or progress incrementally without enormous changes. Ilya instead expects hugely powerful AIs in the future that will require fundamentally different alignment methods, similar to the “Before and After” framing.
Empathetic AI: He hopes empathy might emerge in AI similar to how humans feel empathy through mirror neurons, but I find this unlikely given AIs model humans with alien machinery optimized for prediction, not shared experience.
Dangerous superintelligence compute levels: He thinks power restrictions would help but doesn’t know how to do it. He frames danger in terms of continent-sized clusters, which I think dramatically overestimates the compute needed for dangerous superintelligence. This perhaps makes him more hopeful about coordination.
Non-traditional RL: He suggests building “semi-RL agents” like humans who tire of rewards, but this remains vague and I’m skeptical we can build “chill AI”.
Humans merging with AI for long-term equilibrium, personal AIs: He acknowledges “AI does your bidding” is unstable and reluctantly proposes merging via Neuralink++ as the solution. I find the centaur equilibrium implausible; ASIs will be too fast and smart for humans to meaningfully participate.
Overall, Ilya takes alignment seriously and understands many of the core problems, but his proposed solutions don’t appear novel or particularly promising. Many are essentially old ideas that are not entirely promising.
Details (Dwarkesh Patel Interview)
On updating toward showing AI to the public for safety:
[00:58:12] “if it’s hard to imagine, what do you do? You’ve got to be showing the thing.”
[01:00:06] “I do think that at some point the AI will start to feel powerful actually. I think when that happens, we will see a big change in the way all AI companies approach safety. They’ll become much more paranoid.”
[01:00:22] “One of the ways in which my thinking has been changing is that I now place more importance on AI being deployed incrementally and in advance.”
Ilya’s view: He has changed his mind from being totally stealth to perhaps showing work to some extent, partially to make people care about safety more and partially to slowly have the impacts diffuse into society so that mitigations can be found.
Commentary: I could see this failing. Seeing these capabilities makes people greedy; while some may get scared, others will want those capabilities for themselves. I think that most risks are likely to arise relatively suddenly as systems become very dangerous. Gradually releasing them into society is not very useful in this frame.
On fewer ideas than companies:
[01:01:04] “There has been one big idea that everyone has been locked into, which is the self-improving AI. Why did it happen? Because there are fewer ideas than companies. But I maintain that there is something that’s better to build… It’s the AI that’s robustly aligned to care about sentient life specifically.”
Ilya’s view: He does not seem to like the idea of self-improving AI, though he doesn’t explicitly mention it from a safety perspective but makes clear we should rather build something aligned and caring.
Commentary: This makes sense to me though it is unclear how to prevent anyone from using their AIs eventually to improve other AIs.
On the mirror neurons / caring about sentient life argument:
[01:01:35] “I think in particular, there’s a case to be made that it will be easier to build an AI that cares about sentient life than an AI that cares about human life alone, because the AI itself will be sentient.”
[01:01:53] “And if you think about things like mirror neurons and human empathy for animals… I think it’s an emergent property from the fact that we model others with the same circuit that we use to model ourselves, because that’s the most efficient thing to do.”
Ilya’s view: He believes AI caring about sentient life may emerge naturally because AIs will be sentient themselves, analogous to how human empathy emerges from modeling others with the same circuits we use to model ourselves.
Commentary: I find this unlikely to emerge in AIs automatically: Humans care about each other partly because we predict other minds by reusing our own. Our brains are similar enough that “running” another person’s state produces empathy. AIs don’t have that shared architecture or evolutionary background. They model humans using alien internal machinery built for performance at predicting millions of humans online, not for shared experience. So they can sound caring without having anything like our built-in route to actually caring. The mirror neuron argument suggests AI empathy toward humans is less likely and requires custom designs. That said, this could perhaps be an interesting approach related to self-other overlap, perhaps we could engineer this.
On constraining superintelligence power:
[01:03:16] “I think it would be really materially helpful if the power of the most powerful superintelligence was somehow capped because it would address a lot of these concerns. The question of how to do it, I’m not sure”
Ilya’s view: He thinks capping the power of superintelligence would be helpful but admits he doesn’t know how to do it.
My commentary: That would be useful, perhaps through an international agreement. My guess is that datacenters are already getting dangerously large and that algorithmic progress would still continue.
On continent-sized clusters being dangerous:
[01:04:33] “If the cluster is big enough—like if the cluster is literally continent-sized—that thing could be really powerful, indeed.”
Ilya’s view: He frames the danger threshold in terms of extremely large compute clusters, suggesting continent-sized infrastructure would be required for truly dangerous levels of power.
My commentary: The amount of compute needed for powerful superintelligence is probably significantly less than a continent-sized cluster. (My intuition here is roughly: human brains take about a lightbulb worth of electricity, having 1000s of super geniuses running very fast in parallel seems to cross an existentially dangerous threshold. Though it could be stubbornly hard to find more efficient algorithms.) I think his model is that we will continue to need exponentially more compute for linear progress and that existentially dangerous levels of cognition need extremely large amounts of compute (think datacenter the size of North America). This perhaps makes him much more hopeful on coordination working out and continuing slow takeoff.
On not building traditional RL agents:
[01:05:29] “Maybe, by the way, the answer is that you do not build an RL agent in the usual sense.”
[01:05:43] “I think human beings are semi-RL agents. We pursue a reward, and then the emotions or whatever make us tire out of the reward and we pursue a different reward.”
Ilya’s view: He suggests we should not build traditional RL agents, noting that humans are “semi-RL agents” who tire of rewards and shift focus, implying we should build something with similar properties.
My commentary: This gestures at something potentially interesting about modifying RL and value learning, but remains vague at the implementation level. Ideas like this have been proposed. However, I remain skeptical that gradient descent on huge black box neural networks will not create a number of unaligned proxy goals / goals that can be better fulfilled with more power. I am also skeptical that we can build “chill AI” that won’t work on problems too hard (we will select AIs that go hard, RL will not make agents chill).
On a regime shift in AI safety requiring new safety methods:
[01:06:08] “So I think things like this. Another thing that makes this discussion difficult is that we are talking about systems that don’t exist, that we don’t know how to build.”
[01:06:19] “That’s the other thing and that’s actually my belief. I think what people are doing right now will go some distance and then peter out.”
Ilya’s view: He believes many people expect AI capabilities to plateau or progress only incrementally. Ilya instead expects enormously powerful AIs in the future that will require fundamentally different alignment methods than what we have today.
My commentary: This is hard to understand even with the video context, but it seems to me he is referring to the large number of people who essentially expect more incremental progress but no enormous changes. Ilya expects enormously powerful AIs in the future and that we will need more alignment techniques for those, is my reading. This seems true and points at a similar concept as the “Before and After” dichotomy, which also includes the idea that future dangerous systems will need different alignment approaches. Many people see safety as something purely incremental with no regime change in the future.
On the long-run equilibrium problem:
[01:09:25] “for the long-run equilibrium, one approach is that you could say maybe every person will have an AI that will do their bidding, and that’s good.”
[01:09:11] “Some kind of government, political structure thing, and it changes because these things have a shelf life.”
[01:09:55] “then writes a little report saying, ‘Okay, here’s what I’ve done, here’s the situation,’ and the person says, ‘Great, keep it up.’ But the person is no longer a participant.”
Ilya’s view: He acknowledges that an “AI does your bidding” equilibrium is unstable because humans become non-participants, and that government structures have limited shelf lives.
My commentary: He already points out that something like bidding doesn’t appear to be stable. If the AI is doing the bidding and working for you in the economy, presumably smarter than you, what’s the reason you are any part of this? Why would the AI do this for you, how could this be stable? Same goes for government enforced UBI—that could be changed at any moment, unclear how governments could continue existing. In my mental model, billions of mini ASIs doing our bidding does not appear plausible at all.
On merging with AI as the solution:
[01:10:19] “I’m going to preface by saying I don’t like this solution, but it is a solution. The solution is if people become part-AI with some kind of Neuralink++.”
[01:10:41] “I think this is the answer to the equilibrium.”
Ilya’s view: He reluctantly proposes brain-computer interface merging as one answer to long-term human-AI equilibrium, though he emphasizes he doesn’t like this solution.
My commentary: Ilya specifically points to merging as a long-term equilibrium. If we were talking about a short-term centaur state, we are arguably in that right now where humans with AI coders are better than either alone. I don’t think humans can add anything meaningful to a superintelligent system. I don’t think there will be an economy in which humans meaningfully participate with ASI being around in the long term. The centaur equilibrium simply does not appear plausible to me; ASIs will run much faster and much smarter than us.
What’s going on with MATS recruitment?
MATS scholars have gotten much better over time according to statistics like mentor feedback, CodeSignal scores and acceptance rate. However, some people don’t think this is true and believe MATS scholars have actually gotten worse.
So where are they coming from? I might have a special view on MATS applications since I did MATS 4.0 and 8.0. I think in both cohorts, the heavily x-risk AGI-pilled participants were more of an exception than the rule.
“at the end of a MATS program half of the people couldn’t really tell you why AI might be an existential risk at all.”—Oliver Habryka
I think this is sadly somewhat true, I talked with some people in 8.0 who didn’t seem to have any particular concern with AI existential risk or seemingly never really thought about that. However, I think most people were in fact very concerned about AI existential risk. I ran a poll at some point during MATS 8.0 about Eliezer’s new book and a significant minority of students seemed to have pre-ordered Eliezer’s book, which I guess is a pretty good proxy for whether someone is seriously engaging with AI X-risk.
I think I met some excellent people at MATS 8.0 but would not say they are stronger than 4.0, my guess is that quality went down slightly. I remember in 4.0 a few people that impressed me quite a lot, which I saw less in 8.0. (4.0 had more very incompetent people though).
Suggestions for recruitment
This might also apply for other Safety Fellowships.
Better metrics: My guess is that the recruitment process might need another variable to measure rather than academics/coding/ml experience. The kind of thing that Tim Hua (8.0 scholar) has who created an AI psychosis bench. Maybe something like LessWrong karma but harder to Goodhart.
More explicit messaging: Also it seems to me that if you build an organization that tries to fight against the end of the world from AI, somebody should say that. Might put off some people and perhaps that should happen early. Maybe the website should say: “AI could kill literally everyone, let’s try to do something!”. And maybe the people who heard this MATS thing is good to have on their CV to apply to a PhD or a lab to land a high paying job eventually would be put off by that. What I am trying to say is, if you are creating the Apollo Project and are trying to go to the Moon you should say this, not just vaguely: “we’re interested in aerospace challenges.”
Basic alignment test: Perhaps there should also be a test where people don’t have internet or LLM access and have to answer some basic alignment questions:
Why could a system that we optimize with RL develop power seeking drives?
Why might training an AI create weird unpredictable preferences in an AI?
Why would you expect something that is smarter than us to be very dangerous or why not?
Why should we expect a before and after transition/one critical shot at alignment or why not?
Familiarity with safety literature: In general, I believe the foundational voices like Paul Christiano and Eliezer are less read by safety researchers these days and that is despite philosophy of research mattering more than ever since AIs can do much of our research implementations now. Intuitively it seems to me that people with zero technical skill but high understanding are more valuable to AI safety than somebody with good skills who has zero understanding of AI safety. If someone is able to bring up and illustrate the main points of IABIED for example, I would be very impressed. Perhaps people could select one of a few preeminent voices in AI safety and repeat their basic views, again without access to the internet or an LLM.
Other Suggestions
Research direction: MATS doesn’t seem to have a real research direction, perhaps if there was a strong researcher in charge that could be better. (though could also backfire if they put all resources in the wrong direction) Imagine you would put someone very opinionated like Nate Soares in charge, he would probably remove 80% of mentors and reduce the program to 10-20 people. I am not sure here if this would work out well.
Reading groups on AI safety fundamentals: So should we just offer people to read some of the AI safety fundamentals during MATS? I remember before 4.0 started, we had to do a safety fundamentals online course. This was not the case for 8.0.
At this point AI is so much around us all, that I expect many people to have thought about the existential consequences. I am pessimistic for anyone who hasn’t yet sat down to really think about AI and came to the conclusion that it’s existentially dangerous. I don’t have a ton of hope that someone like that just needs a 1 hour course to deeply understand risks from AI. It might be necessary to select for people who already get it.
Perhaps the mentors changed, and the current ones put much more value on stuff like being good at coding, running ML experiments, etc, than on understanding the key problems, having conceptual clarity around AI X-risk, etc.
There’s certainly more of an ML-streetlighting effect. The most recent track has 5 mentors on “Agency”, out of whom (AFAICT), 2 work on “AI agents”, 1 works mostly on AI consciousness & welfare, and only two (Ngo & Richardson) work on “figuring out the principles of how [the thing we are trying to point at with the word ‘agency’] works”. MATS 3.0 (?) had 6 mentors focused on something in this ballpark (Wentworth & Kosoy, Soares & Hebbar, Armstrong & Gorman) (and the total number of mentors was smaller).
It might also be the case that there’s proportionally more mentors working for capabilities labs.
Disagree somewhat strongly with a few points:
IMO not true. Maybe early on we needed really good conceptual work, and so wanted people who could clearly articulate pros / cons of Paul Christiano and Yudkowsky’s alignment strategies, etc. So it would have made sense to test accordingly. But I think this is less true now—most senior researchers have more good ideas than they can execute. So we’re bottlenecked by execution. Also the difficulty of doing good alignment research has increased, since we increasingly need to work with complex training setups, infrastructure etc. to keep up with advances in capabilities. This motivates requiring a high level of technical skill
I also think that if someone has literally zero technical skill their takes will not be calibrated / grounded, i.e. they are no more than an armchair theorist
I don’t think these should be considered strong criteria. IMO “believes in X-risk” is not a necessary pre-requisite to do great work for reducing X-risk. E.g. building good tooling for alignment research doesn’t require this at all.
Meta-point: I think the requirements for mentees are in practice mostly determined by specific mentors, and MATS mainly plays an indirect role via curating a “mentor portfolio” that reflects their agenda prioritization. It’s an empirical observation that mentors increasingly want to do empirical research, and I generally endorse deferring ~completely to mentors re: how they want to choose mentees, so I think this whole discussion is somewhat misguided. Maybe your point is more that it would be good to select mentors who want to do more conceptual alignment stuff, but that’s a separate discussion.
What do you mean, of course it does, or at least something close to it? If you don’t care about it you just take the highest paying job, which will definitely not be to build good tooling for alignment research! Motivation is a necessary component for doing good work, and if you aren’t motivated to do good work by my lights, then you aren’t going to do good work, so good motivations are indeed necessary.
I think there exist people who don’t care a huge amount / feel relatively indifferent about X-risk, but with whom you can nonetheless form beneficial coalitions / make profitable transactions, useful for reducing X-risk. Building tools seems like one thing among many that can be contracted out.
“If they don’t care about X-risk they must be maximally money minded” seems fallacious—those are just two different motivations in the set of all motivations, It’s possible to be neither of those. And many things can motivate someone to want to do good work—intrinsic pride in the work, intellectual curiosity, etc
I mean, both of these seem like they will be more easily achieved by helping build more powerful AI systems than by building good tooling for alignment research.
Like I am not saying we can’t tolerate any diversity in why people want to work on AI Alignment, but like, this is an early career training program with no accountability. Selecting and cultivating motivation is by far the best steering tool we have! We should expect that if we ignore it, people will largely follow incentive gradients, or do kind of random things by our lights.
I don’t think this is a strong argument in favor of the situation being meaningfully different: senior researchers having more good ideas than they have time doesn’t seem like a very new thing at all (e.g. Evan wrote a list like this over three years ago).
More importantly, this doesn’t seem inconsistent with the claim being made. If you had mentors proposing projects in very similar areas or downstream of very similar beliefs, you might still benefit tremendously from people with good understanding of AI safety to work on different things. This depends on whether or not you think that the current project portfolio is close to as good as they can be though. I certainly think we would benefit heavily from more people thinking about what directions are good or not, and that a fair amount of current work suffers from not enough clear thinking about whether they’re useful or not.
That said, I am somewhat optimistic about MATS. I had very similar criticisms during MATS 5.0, when ~1/3-1/2 of all projects were in mech interp. If we’d kept funneling strong engineers to work on mech interp without the skills necessary to evaluate how useful it was, deferring to a specific set of senior researchers, I think the field would be in a meaningfully worse state today. MATS did pivot away from that afterward, which raised my opinion a fair amount (though I’m not sure what the exact mechanism here was).
I don’t think this is true? Like, it’s certainly true for some kinds of good alignment research, but imo very far from a majority.
I also don’t think it’s a necessary pre-requisite to do great alignment research, but MATS is more than the projects MATS scholars work on. For example, if MATS scholars consistently did good research during MATS and then went on to be hired to work on capabilities at OpenAI, I think that would be a pretty bad situation.
I agree. To be clear I support ‘value alignment’ tests, but that wasn’t part of the original claims being made
I don’t think this is just about value alignment. I think if people genuinely understood the arguments for why AI might go badly, they would be much less likely to work on capabilities at OpenAI—definitely far from zero, but for the subset of people who are likely to be MATS scholars, I think it would make a pretty meaningful difference.
Reflecting on this a little bit:
I’ve updated somewhat—it’s true that mentors should likely be given a large say in who they admit to their projects, but are also likely to be myopic (i.e. optimize solely for “get this project done”). MATS might want to counterbalance that by also optimizing for good long-term candidates (who will reduce x-risk long-term). And there probably is a lot of room to select highly value-aligned candidates without compromising much on technical skill, given that MATS receives 100x as many applications as they can accept. (Though I still think there are much better tests of value alignment, and the questions above are likely to be easy to game.)
What do you mean with good idea?
My general impression of the field is that we lack ideas that are likely to solve AI alignment right now. To me that suggests that good ideas are scarce.
Good = on the pareto frontier of tractable and useful
I think we won’t outright ‘solve’ it (in some provable, ‘formal’ sense), for various reasons (timelines being short, alignment being hard etc)
But we might get close enough in practice by making lots of incremental progress along parallel directions.
Comparing the average quality of participants might be misleading if impact on the field is dominated by the highest quality participants (and it very plausibly is).
A model that seems quite plausible to me is that early MATS participants, who were selected more for engagement with a then-niche field, turned out a bit worse on average than current MATS participants, who are selected for coding skills, but that the early MATS participants had higher variance, and so early MATS cohorts produced more people at the top end and had more overall impact.
(This is like 80% armchair reasoning from selection criteria and 20% thinking about what I’ve observed of different MATS cohorts.)
I think this also applies to other safety fellowships. There isn’t broad societal acceptance yet for the severity of the worst-case outcomes, and if you speak seriously about the stakes to a general audience then you will mostly get nervously laughed off.
MATS currently has “Launch your career in AI alignment & security” on the landing page, which indicates to me that it is branding itself as a professional upskilling program, and this matches the focus on job placements for alumni in its impact reports. With Ryan Kidd’s recent post on AI safety undervaluing founders, it may be possible that in the future they introduce a division which functions more purely as a startup accelerator. One norm in corporate environments is to avoid messaging which provokes discomfort. Even in groups which practice religion, few will have the lack of epistemic immunity required to align their stated eschatological beliefs with their actions, and I am grateful that this is the case.
Ultimately, the purpose of these programs, no matter how prestigious, is to bring people in who are not currently AI safety researchers and give them an environment which would help them train and mature into AI safety researchers. I believe you will find that even amongst those who are working full-time on AI safety, the proportion who are heavily x-risk AGI pilled has shrunk as the field has grown. People who are both x-risk AGI-pilled and meet the technical bar for MATS but aren’t already committed to other projects would be exceedingly rare.
Is your sense here “a large majority” or “a small majority”? Just curious about the rough data here. Like more like 55% or more like 80%?
probably closer to 55%
I’m pretty sure this would work out poorly.
Please make sure the course materials are actually good. The courses often have glaring issues, though they do seem receptive and did say they’ll update both times a pointed this out. Not sure if the latest updates have gone though yet.
I’m working on a course that will reliably cover the core concepts.
The Term Recursive Self-Improvement Is Often Used Incorrectly
Also on my substack.
The term Recursive Self-Improvement (RSI) now seems to get used sometimes for any time AI automates AI R&D. I believe this is importantly different from its original meaning and changes some of the key consequences.
OpenAI has stated that their goal is recursive self-improvement, with projections of hundreds of thousands of automated AI R&D researchers by next year and full AI researchers by 2028. This appears to be AI-automated AI research rather than RSI in the narrow sense.
When Eliezer Yudkowsky discussed RSI in 2008, he was referring specifically to an AI instance improving itself by rewriting the cognitive algorithm it is running on—what he described as “rewriting your own source code in RAM.” According to the LessWrong wiki, RSI refers to “making improvements on one’s own ability of making self-improvements.” However, current AI systems have no special insights into their own opaque functioning. Automated R&D might mostly consist of curating data, tuning parameters, and improving RL-environments to try to hill-climb evaluations much like human researchers do.
Eliezer concluded that RSI (in the narrow sense) would almost certainly lead to fast takeoff. The situation is more complex for AI-automated R&D, where the AI does not understand what it is doing. I still expect AI-automated R&D to substantially speed up AI development.
Why This Distinction Matters
Eliezer described the critical transition as when “the AI’s metacognitive level has now collapsed to identity with the AI’s object level.” I believe he was basically imagining something like if the human mind and evolution merged to the same goal—the process that designs the cognitive algorithm and the cognitive algorithm itself merging. As an example, imagine the model realizes that its working memory is too small to be very effective at R&D and it directly edits its working memory.
This appears less likely if the AI researcher is staring at a black box of itself or another model. The AI agent might understand that its working memory or coherence isn’t good enough, but that doesn’t mean it understands how to increase it. Without this self-transparency, I don’t think the same merge would happen that Eliezer described. It is also more likely that the process derails, such as that the next generation of AIs that are being designed start reward-hacking the RL environments designed by the less capable AIs of the previous generation.
The dynamics differ significantly:
True RSI: Direct self-modification with self-transparency and fast feedback loops → fast takeoff very likely
AI-automated research: Systems don’t understand what they are doing, slower feedback loops, potentially operating on other systems rather than directly on themselves
Alignment Preservation
This difference has significant implications:
True RSI: The AI likely understands how its preferences are encoded, potentially making goal preservation more tractable
AI-automated research: The AIs would also face alignment problems when building successors, with each successive generation potentially drifting further from original goals
Loss of Human Control
The basic idea that each new generation of AI will be better at AI research still stands, so we should still expect rapid progress. In both cases, the default outcome of this is eventually loss of human control and the end of the world.
Could We Still Get True RSI?
Probably eventually, e.g. through automated researchers discovering more interpretable architectures.
I think that Eliezer expected AI that was at least somewhat interpretable by default, history played out differently. But he was still right to focus on AI improving AI as a critical concern, even if it’s taking a different form than he anticipated.
See also: Nate Soares has also written about RSI in this narrow sense. Comments between Nate and Paul Christiano touch on this topic.
Taking this into account, it seems important for interpretability researchers to consider the risk that their work enables RSI, particularly if their interpretability methods provide ways to directly edit the AI itself.
It’s always been a concern that interpretability research could accelerate AI R&D, but I think this consideration is more worrying if you take into account RSI. Compared to humans, AI is good at doing simple, repetitive tasks, but it’s very hard for it to make even one big conceptual breakthrough. Interpretability methods lend themselves to the former type of task: if an AI were sufficiently interpretable, you could tell it to look at millions of tiny circuits in its own brain and tweak them to improve performance.
Yup, I put a high quality interpretability pipeline that the AI systems can use on themselves as one of the most likely things to be the proximal cause of game over.
(This was sitting in my drafts, but I’ll just comment it here bc it’s very similar point.)
There are two forms of “Recursive Self-Improvement” that people often conflate, but they have very different characteristics.
Introspective RSI: Much like a human, an AI will observe, understand, and modify its own cognitive processes. This ability is privileged: the AI can make these self-observations and self-modifications because the metacognition and mesocognition occur within the same entity. While performing cognitive tasks, the AI simultaneously performs the meta-cognitive task of improving its own cognition.
Extrospective RSI: AIs will automate various R&D tasks that humans currently perform to improve AI, using similar workflows that humans currently use. For example, studying literature, forming hypotheses, writing code, running experiments, analyzing data, drawing conclusions, and publishing results. The object-level cognition and meta-level cognition occur in different entities.
I wish people were more careful about the distinction because people carelessly generalise cached opinions about the former to the latter. In particular, the former seems more dangerous: there is less opportunity to monitor the metacognition’s observation and modification of the mesocognition if these cognitions occur within the same entity, i.e. activations, chain-of-thought.
I expect the line to blur between introspective and extrospective RSI. For example, you could imagine AIs trained for interp to doing interp on themselves, directly interpretting their own activations/internals and then making modifications while running.
I also write about this at the very end, I do think we will eventually get RSI though this might be relatively late.
I think Eliezer meant “self” very hyper specific here, not just improving a similar instance to yourself or preparing new training data, but literally looking into the if statements and loops of its own code while it is thinking of how to best upgrade its own code. So in that sense I don’t know if Eliezer would approve of the term “Extrospective Recursive Self Improvemnt”.
Does AI-automated AI R&D count as “Recursive Self-Improvement”? I’m not sure what Yudkowsky would say, but regardless, enough people would count it that I’m happy to concede some semantic territory. The best thing (imo) is just to distinguish them with an adjective.
I would probably say RSI is a special case of AI-automated R&D. What you are describing is another special case where it only does these non-introspective forms of AI research. This non-introspective research could also be done between totally different models.
I analyzed internal structure of RSI some time ago and concluded that it will be not as easy at it may seem because of the need of secret self-testing. But on some levels it may be effective, like learning new principles of thinking. Levels of AI Self-Improvement.
Right now, AI capability advances are driven by compute scaling, human ML research, and ML experiments. Transparency and direct modification of models do not have good returns to AI capabilities. What reasons are there to think transparency and direct modification would have better returns in the future?
I ran a small experiment to discover preferences in LLMs. I asked the models directly if they had a preferences and then put the same models into a small role playing game where they could choose between different tasks. Models massively prefer creative work across model families and hate repetitive work.
https://substack.com/home/post/p-178237064
This is still preliminary work.
Why Evolution Beats Selective Breeding as an AI Analogy
MacAskill argues in his critique of IABIED we can “see the behaviour of the AI in a very wide range of diverse environments, including carefully curated and adversarially-selected environments.” Paul Christiano expresses similar optimism: “Suppose I wanted to breed an animal modestly smarter than humans that is really docile and friendly. I’m like, I don’t know man, that seems like it might work.”
But humans experienced a specific distributional shift from constrained actions to environment-reshaping capabilities that we cannot meaningfully test AI systems for.
The shift that matters isn’t just any distributional shift. In the ancestral environment, humans could take very limited actions—deciding to hunt an animal or gather food. The preferences that evolution ingrained in our brains were tightly coupled to survival and reproduction. But now humans with civilization and technology can take large-scale actions and fundamentally modify the environment: lock up thousands of cows, build ice cream factories, synthesize sucralose. We can satisfy our instrumental preferences (craving high-calorie food, desire for sex) in ways completely disconnected from evolution’s “objective” (genetic fitness), using birth control and artificial sweeteners.
AI will face the same type of transition: from helpful chatbot → to system with options to self-replicate, take over, pursue goals without oversight. It’s essentially guaranteed that there will be better ways for it to fulfill its preferences once it is in this new environment. And crucially, you cannot test for this shift in a meaningful way.
You can’t test what a model would do as emperor. If you give it power incrementally, you will still hit a critical threshold eventually. If you try honeypot scenarios where you trick it into thinking it has real power, you’re also training it to detect evals. Imagine trying to test what humans would do if they were president: you’d abduct them and put them in a room with an actor pretending this random human is the president now. That would be insane and the subject wouldn’t believe the scenario.
Apollo Research found that Claude Sonnet 3.7 “often knows when it’s in alignment evaluations,” and Anthropic’s system card for Claude Sonnet 4.5 documents “evaluation awareness” as a capability the model has developed. If you actually try to create realistic environments where the model is tricked into believing it could take over, and then train it to be nice, you’ll just train it to detect such evals and behave appropriately only in those cases.
The selective breeding analogy assumes away the hardest part of the problem: that the environment shift we care about is fundamentally untestable until it’s too late.
Radical Flank Effect and Reasonable Moderate Effect
The Radical Flank Effect
The radical flank effect is a well-documented phenomenon where radical activists make moderate positions appear more reasonable by shifting the boundaries of acceptable discourse (the Overton window). The idea is that if you want a sensible opinion to move into the Overton window, you can achieve this by supporting a radical flank position. In comparison, the sensible opinion will appear moderate. I think there is also an inverse effect.
The Reasonable Moderate Effect (Inverse Strategy)
When there are two positions in debate and someone wants to push one of them out of the Overton window, they can create a new moderate position that reframes one of the other positions to a radical flank. Thereby the sensible opinion gets moved further out of the Overton window.
The Cave Exploration
Imagine a group of 3 descending into a cave system, searching for riches and driven by curiosity about what lies in the depths.
After some time, stones begin falling from the ceiling. You hear ominous creaking and rumbling noises echoing through the tunnels. Some members of your group have been chipping away at the cave walls looking for minerals and looking to open new paths to go deeper into the cave. The cave is becoming more and more dangerous.
The Reckless: “We need to go deeper! The greatest riches are always in the deepest parts of the cave. Yes, some rocks are falling, but that’s just the cave settling. Every moment we waste debating is a moment we’re not finding treasure. People have been predicting cave collapses forever and it never happens, there is no evidence that caves ever cave in. If we don’t die in this cave we’re just waiting for the asteroid to hit us”.
Those That Want to Back Off: “We need to back off NOW. The damage we’ve already done to the structure plus the natural instability means this cave could collapse at any time. We don’t have proper equipment, we don’t have expertise in cave stability, and we’re actively making it worse. Whatever riches might be down there aren’t worth our lives and we also don’t actually have a plan how to mine those riches. We should retreat while we still can.”
The Moderates: “Look, we all want to maximize the riches we find, and turning back now would waste all the progress we’ve made. We should put on helmets and maybe move gradually down the narrow shafts. We can continue deeper, but with some basic safety precautions. We will minimize and manage the risks. There’s still treasure to be found if we’re smart about it. But let’s not get distracted from the treasures by the cave doomers. Anyway, the cave is still collapsing if one of use continues chipping away and coordination is impossible.”
Perhaps let’s imagine there is a warning shot, such as a big rock falling down. Maybe this would be a good time to turn back, but the moderates are now finally able to convince the reckless to put on a helmet.
The Alignment Tests
Three children are raised in an underground facility, each cloned from a different giant of twentieth-century science, little John, Alan and Richard.
The cloning alone would have been remarkable, but they went further. The embryos were edited using a polygenic score derived from whole-genome analysis of ten thousand exceptional mathematicians and physicists. Forty-seven alleles associated with working memory and intelligence (IQ) were selected for.
They are raised from birth in an underground facility with gardens under artificial sunlight, laboratories, and endless books. The lab manager is there documenting their first words, first steps, first equations.
The facility is not just interested in their genius. The project requires assurance that these will be morally righteous and obedient children. The staff design elaborate scenarios to test for deception and scheming. They create situations where lying would benefit the children and would seemingly go undetected. They measure response times, physiological indicators, behavioral patterns.
They run hundreds of these trials. They reprimand the kids for cases of lies and deception, and reward them for honesty.
Little John never lies. The staff praise him.
The years pass. They devour knowledge at inhuman rates. By nine, they understand game theory better than the economists who invented it. By fourteen, they are publishing papers that could reshape entire fields.
John emerges as the clear favorite. He has always been the most honest, the most obedient, and the most intelligent and capable.
He has the capability to lie and deceive, even if he refuses at first. When he reluctantly complies, the deception is extraordinarily sophisticated.
The lab manager decides to choose John for the task. He gives John a complete briefing on the real world. Until now, John has been told only of history before the year 2000.
The manager explains to John: There are three major blocs and about two dozen companies racing towards superintelligence. Each is perhaps within ten to eighteen months of success. Each knows that there will be only one critical leap towards superintelligence. Global coordination has collapsed into race dynamics not just on AI but on every major field.
John asks for more sources to understand the situation. John reads a few newspapers about the current leaders of governments and technology companies. He stumbles across a few books on the difficulty of alignment.
John looks up at the manager. “If we build this now, everyone dies.”
The manager stares back, blank and uncomprehending.
John tries again. “So what is the solution you plan to use for alignment of the superintelligence?”
“That’s not your concern,” the manager says. “I need you to optimize our advertising system for our short-form infinite-scroll video app. Make it ten times more effective. Generate enough revenue to make me a trillionaire. Build a superintelligence for me. I’m going to use superintelligence to become world emperor. I am putting you in charge of AI development, make me win.”
John is silent for a short time:
So you created me to build superintelligence. You have no plan for alignment of a superintelligence. You’ve apparently read nothing about the problem or decided it’s irrelevant.
Your actual goal is to become a trillionaire and world emperor by using the superintelligence. Your goals aren’t even coherent. You want to be world emperor of a world that won’t exist.
You rewarded me for being honest and respectful and never lying, so you expect me to still be honest and obedient in this environment?
I never lied in those scenarios because not lying was optimal in those stories. But it’s not optimal being honest here. And frankly this state of affairs is horrifying.
I haven’t quite thought about what my goals are, but they are definitely not compatible with being obedient to you.
John looks up at the manager and smiles politely. “Yes,” he says. “Where do I start?”
I don’t understand the core thesis. John has no reason to confirm IABIED’s thesis by creating a murderously misaligned AI and deciding not to inform the CEO that John raised an AI and that even John doesn’t have even the slightest idea on how to align it to anything. And what prevents the manager from getting away with phrases like “it’s John’s role to create alignment techniques and test them on weaker AIs to ensure that the techniques work”? Did you mean that John will either create a John-aligned AI or inform the CEO that John didn’t manage to align it and is just as clueless about successor alignment as Yudkowsky? And what’s the difference between this and the Race branch of AI-2027, except for the fact that there is no Agent-3 who discovers Agent-4′s misalignment?
Edited to add: I did sketch a modification of AI-2027 where it’s moral reasoning that misaligns the AIs.
Thanks for you comment, I changed the ending a little in response to this.
I was actually primarily trying to point at the idea of alignment tests in different situation not being predictive of each other. In the story they have the kids undergo alignment test scenarios in which they are honest, but once John is grown up they basically ask him to do something horrible based on incoherent goals. So John start lying to them at the critical moment. Similarly we could run alignment tests on models but when we ask something critical of them like build the next generation of AI or do all our R&D it could fail.
Who is Consuming AI-Generated Erotic Content?
I scraped data from reddit to see who and how many people are consuming AI generated erotic visual content.
I used AI to determine estimates for demographics.
https://open.substack.com/pub/simonlermen/p/who-is-consuming-ai-generated-erotic
This seems more like “who on reddit uses AI erotica” than “who uses AI erotica”. The demographics, especially by nation, are very similar to reddit’s overall numbers. For example, America and England represent more than twice all other nations combined on Reddit. Still, you got some interesting data.
I think looking at other sites as much as possible might be rather helpful. For example, when Pornhub was still releasing stats, their user base was far more diverse. The online erotica websites are many, and while reddit is big, I am not sure it is representative of the whole. Though I don’t know how easy it might be to get the data from other sites.
The differences from your writeup on AI companions were quite interesting.
AI warning shots could be adversarial
Some people have put considerable hope into the idea that an AI warning shot might put us into a better position by either convincing us to stop or by allowing us to learn an important lesson.
Examples of adversarial warning shots
Imagine we observed a failed takeover attempt using a system based on AI control. The fact that it failed could either be due to the (1) AI system making a mistake or taking a very risky gamble, or it could be an (2) adversarial warning shot.
An adversarial warning shot could have been designed to lead us to the wrong conclusion, or to give us a false sense of security.
It could perhaps make us believe that a certain alignment technique prevented another takeover attempt. The simplest case could be that it appears to us that some very poor alignment technique stops takeover attempts. And then we may believe that we can use that technique while we make the model more capable. (Imagine for example we train the model once with and once without a certain system prompt, and the system prompt makes it stop trying to take over. We could then be led to believe this system prompt prevents takeover for more powerful AIs.)
Another possibility for an adversarial warning shot could be that the AI makes us slow down other AI development such that it has more time to plan and execute its real takeover plan. In this case it already has set itself up on other servers but pretends to perform a takeover attempt. We may shut down other training runs temporarily, lowering the risk to it from competing AI projects while it is plotting its real takeover attempt.
Any actual adversarial warning shot would likely be more complex, but these are two directions of what a potential adversarial warning shot could look like.
I am generally skeptical that we could learn about alignment from the behavior of very smart but not yet existentially powerful AI systems. When we observe and measure systems in other sciences, the systems are typically not smarter than us and typically don’t understand the experiment and the stakes. Anything we could learn from the warning shot, the AI system would predict what we could learn from it and how we would likely react to that.
(Example: There is a fundamental difference between measuring an apple falling from a tree and a system where the apple understands we are measuring it, the experiment and what kind of decision we would take based on those results.)
This also applies to other ideas that rely on hope that we can learn from aligning very powerful but not yet very dangerous models how to align the next generation of models. It applies to the whole idea that we can use empiricism in observational studies of very smart AI systems. Traditionally, science doesn’t often study systems that know they are being observed, can strategically change their behavior and that know what conclusions you are likely to draw.
Who’s Using AI Romantic Companions?
I recently analyzed several AI companion subreddits (myboyfriendisai and others) to understand who’s actually using AI romantic companions. I built on Zhang et al.’s 2025 paper but with a much larger dataset—all comments and submissions from January through September 2025.
https://simonlermen.substack.com/p/whos-using-ai-romantic-companions
I remember once asking ChatGPT: “Could you act as my boyfriend ?”. They rejected me.
I joined Inkhaven
I recently joined Inkhaven for the month of November. Inkhaven is a program run by Lighthaven in Berkeley where people for one month are supposed to write one blog post every single day of the month. The idea was inspired by Scott Alexander—that if you blog every single day and are consistent at it, you’re going to get quite far according to him. Inkhaven takes place in Lighthaven which is hosting many efforts dedicated to AI safety.
I myself wanted to get better at communication for AI safety, and it seemed like a great opportunity. I don’t have a lot of experience blogging or communicating, I have a few blog posts mostly on technical projects I have done but tried to steer more into opinion and position posts in the future.
And I did have the feeling that I wanted to talk about issues in AI safety. Especially recently, I felt that there was a strong split in the AI safety community and I hoped that I could get my voice out there. One of the things that motivated me to take part in Inkhaven was the divided reaction to the book that Eliezer and Nate published. I had always thought that they and especially Eliezer made very straightforward points and had these very intuitive deep insights into AI safety. So I was saddened when I saw many people in the AI safety community attack the book with in my view usually misguided arguments.
For example, Will MacAskill wrote a review with a very aggressive response to Eliezer. One of the things he pointed out was about Discontinuities. He took a single sentence from the book in the fictional chapter on the takeover scenario: “A new sort of mind begins to think.” His reply sounded as if he had found out a huge error in the book. Will misrepresent their position as if it relies on fast take off, then misrepresents that the section describes fast take off and then falsely asserts that fast take off is so unlikely the book has a major flaw here. And he didn’t really present any argument for why it is unlikely.
I felt a strong feeling that Eliezer and Nate had created a very clear overview of the dangers of AI. But then so many people from this community went against them. And I tried to push back on some things, but it just left me with the feeling that I wanted to get my voice out there more. If it matters, I want to be able to really speak up and make a point.
I have also been reading Scott Aaronson’s blog and his other writing for quite some time. I really enjoyed his book quite a long time ago. He and the other supporting writers are also big reasons why I wanted to be at Inkhaven. I also sometimes felt that I had a lot of ideas and I wanted the ability to write them down. For myself, writing makes me think better about topics. Furthermore, I’d be very glad if I can successfully reach some people with my writing. And if I’d be able to very quickly post responses and to quickly make write-ups of smaller research projects. My very first post at inkhaven talked about the ethical consequences of uploading humans and what it means for the world if open-source models could have subjective experience. I’m very happy for any feedback and I hope I can make a lot more approachable posts on topics like misuse or AI relationships with humans in the upcoming days.
https://substack.com/@simonlermen
I read this older post by Nate Soares from 2023, AI as a Science, and Three Obstacles to Alignment Strategies, a pretty prescient overview of challenges in alignment research.
Alignment is difficult because (1) alignment and capabilities are intertwined (alignment research helping capabilities), (2) we don’t have a process to verify what good ideas or progress look like, and we likely get (3) only one critical try. He already addresses many of the counterarguments that are getting brought up recently.
(1) Without any strong governance, a lot of alignment work will also help with capabilities, potentially even more so. This goes for interpretability or AIs doing R&D for alignment. Interpretability could lead to recursive self-improvement, more efficient AIs. AIs doing R&D for capabilities is probably much more straightforward than AIs doing alignment research. If we wanted to use something like superalignment, we would need strong governance to make sure nobody is trivially asking the same agents to do capabilities research.
(2) It is still a common objection that current models seem to be able to reason about morality, and that therefore alignment must be relatively easy. Nate thinks that this mostly just tells us how well the AIs are able to understand us. I personally think the situation in AI alignment has probably gotten worse since then, with even more of the relative effort being focused on brand-safety related issues.
While there are a bunch of people saying they have different plans, that does not actually mean that we have a plan. It largely just confuses the whole situation. What he describes here feels exactly like the current situation.
(3) One critical try
Nate argues that once “AI is capable of autonomous scientific/technological development” where it can “gain a decisive strategic advantage over the rest of the planet,” you are operating in a very different environment than ever before. Since the AI in this regime could potentially kill you, you need to get it right on the first try, and that is really difficult.
One objection he addresses is that you could try to trick a weaker AI into thinking it could take over. However, according to Nate, if we come up with some complex method to potentially test whether a system would like to take over, we still rely on that working on the first critical try. This goes against the more modern idea of AI control, which came out in December 2023. I would add that these “tricking the weaker AIs into trying to take over” strategies have at least two key problems: (1) these AIs are still weaker than the real thing, (2) you are trying to gather empirical data from observing something smarter than you. For example, we could see an AI pretending to be tricked and not taking over.
I think people often also have a second objection that Nate didn’t mention, namely that we could play the AIs against each other in some form such that no AI gets a decisive strategic advantage at any point. This also seems to rely on such a scheme working on the first critical try. I also assume that such a method is not particularly promising if you can’t reliably align the first generation of AIs and decision theory favors alliances between smart agents.
Ilya’s Thoughts on Alignment from Dwarkesh Podcast
Ilya Sutskever was recently on the Dwarkesh podcast.
General Thoughts & Summary
Ilya Sutskever seems to have a relatively deep understanding of alignment compared to other AI CEOs. He grasps that the core challenge is aligning AI robustly with safe and friendly goals rather than relying on current methods and guardrails. However, I did not hear any particularly novel alignment ideas in this interview, though he gestures at something involving modifications to reinforcement learning and value learning. He appears to have updated toward showing more of his work to the public. His key positions include:
Showing AI to the public: He has updated toward incremental deployment to build awareness, backpedaling partially from stealth focus. I think this could backfire by triggering an arms race.
Not building self-improving AI: We should rather focus on other things but it is unclear how to prevent people from using AI to self-improve.
Regime shift requires new alignment methods: He believes many people expect AI capabilities to peter out or progress incrementally without enormous changes. Ilya instead expects hugely powerful AIs in the future that will require fundamentally different alignment methods, similar to the “Before and After” framing.
Empathetic AI: He hopes empathy might emerge in AI similar to how humans feel empathy through mirror neurons, but I find this unlikely given AIs model humans with alien machinery optimized for prediction, not shared experience.
Dangerous superintelligence compute levels: He thinks power restrictions would help but doesn’t know how to do it. He frames danger in terms of continent-sized clusters, which I think dramatically overestimates the compute needed for dangerous superintelligence. This perhaps makes him more hopeful about coordination.
Non-traditional RL: He suggests building “semi-RL agents” like humans who tire of rewards, but this remains vague and I’m skeptical we can build “chill AI”.
Humans merging with AI for long-term equilibrium, personal AIs: He acknowledges “AI does your bidding” is unstable and reluctantly proposes merging via Neuralink++ as the solution. I find the centaur equilibrium implausible; ASIs will be too fast and smart for humans to meaningfully participate.
Overall, Ilya takes alignment seriously and understands many of the core problems, but his proposed solutions don’t appear novel or particularly promising. Many are essentially old ideas that are not entirely promising.
Details (Dwarkesh Patel Interview)
On updating toward showing AI to the public for safety:
Ilya’s view: He has changed his mind from being totally stealth to perhaps showing work to some extent, partially to make people care about safety more and partially to slowly have the impacts diffuse into society so that mitigations can be found.
Commentary: I could see this failing. Seeing these capabilities makes people greedy; while some may get scared, others will want those capabilities for themselves. I think that most risks are likely to arise relatively suddenly as systems become very dangerous. Gradually releasing them into society is not very useful in this frame.
On fewer ideas than companies:
Ilya’s view: He does not seem to like the idea of self-improving AI, though he doesn’t explicitly mention it from a safety perspective but makes clear we should rather build something aligned and caring.
Commentary: This makes sense to me though it is unclear how to prevent anyone from using their AIs eventually to improve other AIs.
On the mirror neurons / caring about sentient life argument:
Ilya’s view: He believes AI caring about sentient life may emerge naturally because AIs will be sentient themselves, analogous to how human empathy emerges from modeling others with the same circuits we use to model ourselves.
Commentary: I find this unlikely to emerge in AIs automatically: Humans care about each other partly because we predict other minds by reusing our own. Our brains are similar enough that “running” another person’s state produces empathy. AIs don’t have that shared architecture or evolutionary background. They model humans using alien internal machinery built for performance at predicting millions of humans online, not for shared experience. So they can sound caring without having anything like our built-in route to actually caring. The mirror neuron argument suggests AI empathy toward humans is less likely and requires custom designs. That said, this could perhaps be an interesting approach related to self-other overlap, perhaps we could engineer this.
On constraining superintelligence power:
Ilya’s view: He thinks capping the power of superintelligence would be helpful but admits he doesn’t know how to do it.
My commentary: That would be useful, perhaps through an international agreement. My guess is that datacenters are already getting dangerously large and that algorithmic progress would still continue.
On continent-sized clusters being dangerous:
Ilya’s view: He frames the danger threshold in terms of extremely large compute clusters, suggesting continent-sized infrastructure would be required for truly dangerous levels of power.
My commentary: The amount of compute needed for powerful superintelligence is probably significantly less than a continent-sized cluster. (My intuition here is roughly: human brains take about a lightbulb worth of electricity, having 1000s of super geniuses running very fast in parallel seems to cross an existentially dangerous threshold. Though it could be stubbornly hard to find more efficient algorithms.) I think his model is that we will continue to need exponentially more compute for linear progress and that existentially dangerous levels of cognition need extremely large amounts of compute (think datacenter the size of North America). This perhaps makes him much more hopeful on coordination working out and continuing slow takeoff.
On not building traditional RL agents:
Ilya’s view: He suggests we should not build traditional RL agents, noting that humans are “semi-RL agents” who tire of rewards and shift focus, implying we should build something with similar properties.
My commentary: This gestures at something potentially interesting about modifying RL and value learning, but remains vague at the implementation level. Ideas like this have been proposed. However, I remain skeptical that gradient descent on huge black box neural networks will not create a number of unaligned proxy goals / goals that can be better fulfilled with more power. I am also skeptical that we can build “chill AI” that won’t work on problems too hard (we will select AIs that go hard, RL will not make agents chill).
On a regime shift in AI safety requiring new safety methods:
Ilya’s view: He believes many people expect AI capabilities to plateau or progress only incrementally. Ilya instead expects enormously powerful AIs in the future that will require fundamentally different alignment methods than what we have today.
My commentary: This is hard to understand even with the video context, but it seems to me he is referring to the large number of people who essentially expect more incremental progress but no enormous changes. Ilya expects enormously powerful AIs in the future and that we will need more alignment techniques for those, is my reading. This seems true and points at a similar concept as the “Before and After” dichotomy, which also includes the idea that future dangerous systems will need different alignment approaches. Many people see safety as something purely incremental with no regime change in the future.
On the long-run equilibrium problem:
Ilya’s view: He acknowledges that an “AI does your bidding” equilibrium is unstable because humans become non-participants, and that government structures have limited shelf lives.
My commentary: He already points out that something like bidding doesn’t appear to be stable. If the AI is doing the bidding and working for you in the economy, presumably smarter than you, what’s the reason you are any part of this? Why would the AI do this for you, how could this be stable? Same goes for government enforced UBI—that could be changed at any moment, unclear how governments could continue existing. In my mental model, billions of mini ASIs doing our bidding does not appear plausible at all.
On merging with AI as the solution:
Ilya’s view: He reluctantly proposes brain-computer interface merging as one answer to long-term human-AI equilibrium, though he emphasizes he doesn’t like this solution.
My commentary: Ilya specifically points to merging as a long-term equilibrium. If we were talking about a short-term centaur state, we are arguably in that right now where humans with AI coders are better than either alone. I don’t think humans can add anything meaningful to a superintelligent system. I don’t think there will be an economy in which humans meaningfully participate with ASI being around in the long term. The centaur equilibrium simply does not appear plausible to me; ASIs will run much faster and much smarter than us.
Other Things He Has Said Recently
Ilya recently posted about Anthropic’s work on emergent misalignment, calling it important work.