Aaron_Scher

Karma: 1,448

Things I post should be considered my personal opinions, not those of any employer, unless stated otherwise.

https://ascher8.github.io/

Aaron_Scher 21 Nov 2025 22:30 UTC
3 points
0
in reply to: Jesper L.’s comment on: New Report: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence
Thanks for the comment. I think this is definitely one of the places that would both receive lots of negotiation, and where we don’t have particular expertise. Given my lack of expertise, I don’t have much confidence in the particular withdrawal terms.
One of the frames that I think is really important here is that we are imagining this agreement is implemented in a situation where (at least some) world leaders are quite concerned with ASI risk. As such, countries in the agreement do a bunch of non-proliferation-like activities to prevent non-parties from getting AI infrastructure. So the calculus looks like “join the agreement and get access to AI chips to run existing AI models” vs. “don’t join the agreement and either don’t get access to AI chips or be at risk of coalition parties disrupting your AI activities”. That is, I don’t expect ‘refusing to sign’ or withdrawing to be particularly exciting opportunities, given the incentives at play. (and this is more a factor of the overall situation and risk awareness among world leaders, rather than our particular agreement)

Aaron_Scher 21 Nov 2025 0:57 UTC
4 points
0
in reply to: Katalina Hernandez’s comment on: Considerations for setting the FLOP thresholds in our example international AI agreement
Thanks for the nudge. Here’s a linkedin post that you are welcome to share! Thanks!

Aaron_Scher 20 Nov 2025 23:50 UTC
4 points
0
in reply to: Katalina Hernandez’s comment on: Considerations for setting the FLOP thresholds in our example international AI agreement
Thanks for adding that context, I think it’s helpful!
In some sense, I hope that this post can be useful for current regulators who are thinking about where to put thresholds (in addition to future regulators who are doing so in the context of an international agreement).
I think these thresholds were set by political feasibility rather than safety analysis
Yep, I think this is part of where I hope this post can provide value: we focused largely on the safety analysis part.

Aaron_Scher 20 Nov 2025 21:47 UTC
5 points
0
in reply to: mishka’s comment on: Preventing covert ASI development in countries within our agreement
Hey, thanks for your comment! My colleague Robi will have a blog post out soon dealing with the question “how does the agreement prevent ASI development in countries that are not part of the agreement?”, which it sounds like is basically what you’re getting at. I think this is a non-trivial problem, and Latin American drug cartels seem like an interesting case study—good point!
As a clarification, I don’t expect this to be true in our agreement: “there will be so many existing useless GPU chips”. Our agreement lets existing GPUs stick around as long as they’re being sufficiently monitored (Chip Use Verification), and it’s fair game for people to do inference on today’s models. So I think there will be a strong legal market for GPUs, albeit not as strong as today’s because there will be some restrictions on how they can be used.
As a tldr on my thinking here: It’s similar to existing WMD non-proliferation problems. Coalition countries will crack down pretty hard on chip smuggling and try to deny cartel access to chips. The know-how for frontier AI development is pretty non-trivial and there aren’t all that many relevant people. While some of them might defect to join the cartels, I think there will be a lot of ideological/social/peer pressure against doing this, in addition to various prohibition efforts from the government (e.g., how governments intervene on terrorist recruitment).

Aaron_Scher 19 Nov 2025 20:11 UTC
12 points
2
in reply to: anaguma’s comment on: New Report: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence
Thanks for your comment! Conveniently, we wrote this post about why we pick the training compute thresholds we did in the agreement (1e24 is the max). I expect you will find it interesting, as it responds to some of what you’re saying! The difference between 1e24 and 5e26 largely explains our difference in conclusions about what a reasonable unmonitored cluster size should be, I think.
You’re asking the right question here, and it’s one we discuss in the report a little (e.g., p. 28, 54). One small note on the math is that I think it’s probably better to use FP8 (so 2e15 theoretical FLOP per H100 due to the emergence of FP8 training).

Aaron_Scher 19 Nov 2025 17:38 UTC
4 points
0
in reply to: TsviBT’s comment on: New Report: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence
Hey Tsvi, thanks for the ideas! The short answer is that we don’t have good answers about what the details of Article VIII should be. It’s reasonably likely that I will work on this as my next big project and that I’ll spend a couple of months on it. If so, I’ll keep these ideas in mind—they seem like a reasonable first pass.

Aaron_Scher 3 Nov 2025 19:01 UTC
2 points
2
in reply to: Liron’s comment on: Should you donate to Lightcone Infrastructure?
This has some disagree votes. What’s wrong with this idea?
Relatedly, probably the wrong thread but somewhat relevant, I don’t feel great about my donations to a nonprofit funding their “hotel/event venue business” (as I would call it). Is it that Lighthaven wants to offer some services to some groups/events at a discount, and donations are subsidizing this?
If so, Lightcone should probably make the case that they are cost-competitive with doing this for other event venues (e.g., donations being used to rent a Marriott or Palace of Fine Arts). There’s clearly aesthetic differences, and maybe Lighthaven is the literal best event venue in the world by my preferences. But this is a nontrivial argument. (I didn’t see this argument made in a quick skim of the 2024 fundraising post)

Aaron_Scher 20 Oct 2025 23:53 UTC
15 points
9
in reply to: mtaran’s comment on: The “Length” of “Horizons”
they have indeed seen the skulls
I think this is true, but I also don’t think seeing the skulls implies actually dealing with them (and wish Scott’s post was crossposted here so I could argue with it). Like, a critique of AI evaluations that people could have been making for the last 5+ years (probably even 50) and which remains true today is “Evaluations do a poor job measuring progress toward AGI because they lack external validity. They test scenarios that are much narrower, well defined, more contrived, easier to evaluate, etc. compared to the skills that an AI would need to be able to robustly do in order for us to call it AGI.” I agree that METR is well aware of this critique, but the critique is still very much true of HCAST, RE-Bench, and SWAA. Folks at METR seem especially forward about discussing the limitations of their work in this regard, and yet the critique is still true. (I don’t think I’m disagreeing with you at all)

Aaron_Scher 17 Oct 2025 17:43 UTC
6 points
2
in reply to: Bronson Schoen’s comment on: The “Length” of “Horizons”
We care about the performance prediction at a given point in time for skills like “take over the world”, “invent new science”, and “do RSI” (and “automate AI R&D”, which I think the benchmark does speak to). We would like to know when those skills will be developed.
In the frame of this benchmark, and Thomas and Vincent’s follow up work, it seems like we’re facing down at least three problems:
1. The original time horizons tasks are clearly out of the distribution we care about. Solution: create a new task suite we think is the right distribution.
2. We don’t know how well time horizons will do at predicting future capabilities, even in this domain. Solution: keep collecting new data as it comes out in order to test predictions on whatever distributions we have, examine things like the conceptual coherence objection and try to make progress.
3. We don’t know how well the general “time horizons” approach works across domains. We have some data on this in the follow up work, maybe it’s a 2:1 update from a 1:1 prior?
So my overall take is that I think the current work I’m aware of tells us
- Small positive update on time horizons being predictive at all.
- A small positive update on the specific Software Engineering trends being predictive within distribution.
- Small positive update on “time horizons” being common across different reasonable and easy to define distributions.
- And on “doubling time in the single digit months” being the rate of time horizon increase across many domains.
- A small negative update on the specific time horizon length from one task distribution generalizing to other task distributions (maybe an update, tbh the prior is much lower than ⁵⁰⁄₅₀). So it tells us approximately nothing about the performance prediction at a given point in time for the capabilities I care about.

Aaron_Scher 16 Oct 2025 20:22 UTC
20 points
12
in reply to: Bronson Schoen’s comment on: The “Length” of “Horizons”
That you think are sufficient to meaningfully change how valid people should interpret the overall predicted trend?
I’m not Adam, but my response is “No”, based on the description Megan copied in thread and skimming some of the paper. It’s good that the paper includes those experiments, but they don’t really speak to the concerns Adam is discussing. Those concerns, as I see it (I could be misunderstanding):
1. Conceptual coherence: in humans there are different skills, e.g., between different fields, that don’t seem to easily project onto a time horizon dimension. Or like, our sense of how much intelligence is required for them or how difficult they are does not correspond all that closely with the time taken to do them.
2. Benchmark bias: solution criteria is known and progress criteria is often known; big jump from that to the real world scary things we’re worried about.
Do the experiments in Sec 6 deal with this?
1. No SWAA (“Retrodiction from 2023–2025 data”): Does not deal with 2. Mostly does not deal with 1, as both HCAST + RE-Bench and All-3 are mostly sofware engineerig dominated with a little bit of other stuff.
2. Messiness factors: Does not speak to 1. This is certainly relevant to 2, but I don’t think it’s conclusive. Quoting from the paper some:
We rated HCAST and RE-Bench tasks on 16 properties that we expected to be 1) representative of how real world tasks might be systematically harder than our tasks and 2) relevant to AI agent
performance. Some example factors include whether the task involved a novel situation, was constrained by a finite resource, involved real-time coordination, or was sourced from a real-world
context. We labeled RE-bench and HCAST tasks on the presence or absence of these 16 messiness
factors, then summed these to obtain a “messiness score” ranging from 0 to 16. Factor definitions
can be found in Appendix D.4.
The mean messiness score amongst HCAST and RE-Bench tasks is 3.2/16. None of these tasks have a messiness score above ⁸⁄₁₆. For comparison, a task like ’write a good research paper’ would score between ⁹⁄₁₆ and ¹⁵⁄₁₆, depending on the specifics of the task.
On HCAST tasks, AI agents do perform worse on messier tasks than would be predicted from the
task’s length alone (b=-0.081, R2 = 0.251) …
However, trends in AI agent performance over time are similar for lower and higher messiness
subsets of our tasks.
This seems like very weak evidence in favor of the hypothesis that Benchmark Bias is a big deal. But they just don’t have very messy tasks.
c. SWE-Bench Verified: doesn’t speak to 1 or 2.
d. Internal PR experiments: Maybe speaks a little to 1 and 2 because these are more real world, closer to the thing we care about tasks, but not much, as they’re still clearly verifiable and still software engineering.
I do think Thomas and Vincent’s follow up work here on time horizons for other domains is useful evidence pointing a little against the conceptual coherence objection. But only a little.

Aaron_Scher 9 Oct 2025 1:04 UTC
LW: 13 AF: 8
2
AF
on: Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Any guesses for what’s going on in Wichers et al. 3.6.1?
3.6.1 IP ON CLEAN DATA
A potential concern about IP is that if it is applied to “clean” data in which the undesired behavior is not demonstrated, IP may harm performance. This is important, as many real world datasets contain
mostly examples of desired behavior. We test this by applying IP in our settings on clean data. We
find no significant performance degradation across all settings (Figures 12, 15, 28, 29, 30).
It seems like this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt.
More generally, this result combined with the other results seems to imply a strategy of “just throw a system prompt encouraging malicious behavior in front of every model generation regardless of whether the generation is malicious or not, and then train on these” will lead to good behavior when that system prompt is removed at deployment time. That just seems implausible.

Aaron_Scher 1 Oct 2025 20:50 UTC
2 points
0
in reply to: Buck’s comment on: A Reply to MacAskill on “If Anyone Builds It, Everyone Dies”
Yep, thanks.

Aaron_Scher 1 Oct 2025 17:33 UTC
4 points
0
in reply to: 1a3orn’s comment on: A Reply to MacAskill on “If Anyone Builds It, Everyone Dies”
Hey, thanks for the feedback! I helped write this section. A few notes:
I think you’re right that comparing to consumer GPUs might make more sense, but I think comparing to other computers is still acceptable. I agree that GPUs is where you start running into prohibitions first. But I think it’s totally fair to compare to “average computers” because one of the main things I care about is the cost of the treaty. It’s not so bad if we have to ban top of the line consumer GPUs, but it would be very costly / impossible if we have to ban consumer laptops. So comparing to both of these is reasonable.
The text says “consumer CPUs” because this is what is discussed in the relevant source, and I wanted to stick to that. Due to some editing that happened, it might not have been totally clear where the claim was coming from. The text has been updated and now there’s a clear footnote.
I know that “consumer CPUs” is not literally the best comparison for, say, consumer laptops. For example, macbooks have an integrated CPU-GPU. I think it is probably true that H100s are like 3-300x better than most consumer laptops at AI tasks, but to my knowledge there is no good citable work explaining this for a wide variety of consumer hardware (I have some mentees working on it now, maybe in a month or two there will be good work!).
I’ll toss out there that NVIDIA sells personal or desktop GPUs that are marketed as for AI (like this one). These are quite powerful, often within 3x of the datacenter GPUs in terms of most of their performance. I expect these to get categorized as “AI chips” under the treaty and thus become controlled. The difference between H100s and top consumer GPUs is not 1000x, and it probably isn’t even 10x. In this tentative draft treaty, we largely try to punt questions like “what exactly counts as an AI chip” to the hypothetical technical body that helps implement the treaty, and my current opinions about this are weak.

Aaron_Scher 29 Sep 2025 14:13 UTC
16 points
0
in reply to: boazbarak’s comment on: A non-review of “If Anyone Builds It, Everyone Dies”
Thanks for your reply. Noting that it would have been useful for my understanding if you had also directly answered the 2 clarifying questions I asked.
There are a lot of bad things that AIs can do before literally taking over the world.
Okay, it does sound like you’re saying we can learn from problems A, B, and C in order to inform D. Where D is the model tries to take over once it is smart enough. And A is like jailbreak-ability and B is goal preservation. It seems to me like somebody who wants humanity to gamble on the superalignment strategy (or otherwise build ASI systems at all, though superalignment is a marginally more detailed plan) needs to argue that our methods for dealing with A, B, and C are very likely to generalize to D.
Maybe I’m misunderstanding though, it’s possible that you mean the same AIs that want to eventually take over will also take a bunch of actions to tip their hand earlier on. This seems mostly unlikely to me, because that’s an obviously dumb strategy and I expect ASIs to not pursue dumb strategies. I agree that current AIs do dumb things like this, but these are not the AIs I’m worried about.
Whether that would be enough to take over the world at that point in time is a different questoin.
To repeat my second clarifying question from above, do you believe that at some point there will be AIs that could succeed at takeover if they tried? If we were talking about the distribution shift that a football team undergoes from training to Game Day, and you didn’t think the game would ever happen, that sounds like it’s the real crux, not some complicated argument about how well the training drills match the game.

Aaron_Scher 29 Sep 2025 4:07 UTC
27 points
2
on: A non-review of “If Anyone Builds It, Everyone Dies”
Either I don’t understand your view about continuous development and threshold things, or I think it’s probably wrong. I’ve decided to spin this out into its own comment, though I originally wrote it as part of this one.
I’m going to think out loud: One story I could tell for how we cross the gap from “the AI would not succeed at taking over” to “it would succeed if it tried”, in a continuous manner, is as @Nina Panickssery points out in a tweet, “The AI could be more or less likely to succeed”.
(My understanding is that we’re actually focusing on the AI’s beliefs about its likelihood of success, which I’ll call AI-p(success), not p(success) according to some observer—the AI’s beliefs are what shape its decisions.) So here’s an example model: maybe as the AI gets smarter or acquires more resources, its p(success) rises at 1 percentage point per day. We think the AI might do something like an expected value calculation, only trying to takeover when p(success) crosses some threshold. (But, like, we don’t know that that’s true. And we don’t know what the threshold is.) If the AI is doing some threshold thing like this, then it looks all good and chill until day 23 or whatever when the threshold is hit and the AI fully commits to a takeover. Under that AI decision procedure, it looks like by default we get this rough discontinuity where we don’t get any more iteration after day 22.
Now imagine a really cautious AI company, maybe during each day they try really hard to trick some instance of the AI that its AI-p(success) is actually higher than it is, and they’re able to push up AI-p(success) by 5 percentage points even while the actual likelihood of takeover is not inflated by 5 points (maybe they also do extra security measures for this AI, and together the adversarial testing and security create a 5% spread). So then on day 18 the AI tries to take over and we try not to be unlucky on an 18% takeover chance. And if we succeed at containing the AI, maybe we can run more experiments on this 18 day AI. What experiments are we going to run? I guess we’ll try to make it so that even when AI-p(success) exceeds 23% the AI does not try to take over, but idk how.
There are many issues with this plan:
1. We don’t actually know the decision process of the AI and don’t know if this model is at all right. Maybe instead it has the totally illogical decision process of trying hard to takeover in proportion to AI-p(success), so on day 1 it tries very little, on day 10 some, and on day 50 it tries pretty hard; that seems like an insanely bad strategy (and humanity might actually win if the AIs always took strategies like this), but there’s a bunch of other strategies the AI could be taking, some of which could be better or worse than the model here.
2. Even if we know the high-level decision process of the AI, we don’t know the threshold. (Aside: I assume the AI is well calibrated and a good forecaster, other than the modifications we make, i.e., its p(success) is accurate). If the threshold is high, then we are basically screwed in this scenario. Like if the AI follows the policy “only attempt take over if AI-p(success) ≥ 99%”, then we’re taking on a huge amount of risk (i.e., dying) unless we could somehow artificially modulate AI-p(success) by a ton. Like, using the numbers above, then the AI tries to takeover when its actual chance of success is actually merely 94%, ruh roh.
3. It’s not obvious that we’ll be able to measure AI-p(success) with much confidence. Even if we knew the threshold, we probably can’t actually measure the thing. AI mind reading seems kinda hard, and while I definitely think it’s plausible that we’ll get this, I might be at like ⁵⁰⁄₅₀. Doing the above scheme without being able to measure AI-p(success) seems extremely difficult due to not knowing if your methods for modifying AI-p(success) are working.
4. I don’t expect us to know observer-p(success) with much confidence. It seems like most of the claims from AI companies about current AI capabilities are pretty soft, unconfident claims. For instance, OpenAI, Anthropic, and Google DM have all said in the last few months that their frontier models might be capable of significant CBRN (bio especially) uplift for unskilled actors, but they aren’t sure. OpenAI: “We do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm, our defined threshold for High capability, and the model remains on the cusp of being able to reach this capability.”Anthropic: “Although we have not determined whether the ASL-3 Standard is required for this model”. Google: “Further study is required to reach our final assessment of whether the model has reached the CCL for CBRN Uplift Level 1.” Now maybe this is just corporate hedging, but I think it’s way more likely, and supported by a preponderance of evidence, that assessing AI capabilities, especially for real world tasks, is hard to do and we are currently bad at it. I basically expect that in a couple years we’ll enter a regime of “the experts do not know if frontier AIs would be able to successfully escape their containment or successfully takeover if they tried, it seems like at least 1% likely” and probably our p(success) will go up as AI capabilities increase, but it won’t be well justified ever and there will be significant disagreement about the number among experts.
5. It’s unclear if we’ll be able to change AI-p(success) much. 5 percentage points felt reasonable but idk. This is certainly a place where if we don’t try very hard at this kind of red-teaming, then we certainly won’t succeed at this part.
6. Unfortunately, it looks like it might be dangerous to experiment with the 18-day AI, on account of it having an 18% chance of successful take over. It’s unclear how correlated outcomes will be across different experiments (i.e., are you rolling 18% each time or not), especially considering how the humans can potentially patch various pathways that different instances take when attempting takeover. But gosh, looks rough.
7. It’s not clear what experiments one would run with the 18-day AI that would actually help cross the gap robustly. For instance, maybe we’ll just end up pushing the threshold up a couple days or changing the AI’s decision process; it seems like there are a lot of ways we could fail at this step.
I could be totally misunderstanding Nina’s idea, this is all very complicated.

Aaron_Scher 29 Sep 2025 4:01 UTC
44 points
28
in reply to: boazbarak’s comment on: A non-review of “If Anyone Builds It, Everyone Dies”
I’m not sure what Eliezer thinks, but I don’t think it’s true that “you cannot draw any useful lessons from [earlier] cases”, and that seems like a strawman of the position. They make a bunch of analogies in the book, like you launch a rocket ship, and after it’s left the ground, your ability to make adjustments is much lower; sure you can learn a bunch in simulation and test exercises and laboratory environments, but you are still crossing some gap (see p. ~163 in the book for full analogy). There are going to be things about the Real Deal deployment that you were not able to test for. One of those things for AI is that “try to take over” is a more serious strategy, somewhat tautologically because the book defines the gap as:
Before, the AI is not powerful enough to kill us all, nor capable enough to resist our attempts to change its goals. After, the artificial superintelligence must never try to kill us, because it would succeed. (p. 161)
I don’t see where you are defusing this gap or making it nicely continuous such that we could iteratively test our alignment plans as we cross it.
It seems like maybe you’re just accepting that there is this one problem that we won’t be able to get direct evidence about in advance, but you’re optimistic that we will learn from our efforts to solve various other AI problems which will inform this problem.
When you say “by studying which alignment methods scale and do not scale, we can obtain valuable information”, my interpretation is that you’re basically saying “by seeing how our alignment methods work on problems A, B, and C, we can obtain valuable information about how they will do on separate problem D”. Is that right?
Just to confirm, do you believe that at some point there will be AIs that could succeed at takeover if they tried? Sometimes I can’t tell if the sticking point is that people don’t actually believe in the second regime.
I don’t believe that there is going to be an alignment technique that works one day and completely fails after a 200K GPU 16 hour run.
There are rumors that many capability techniques work well at a small scale but don’t scale very well. I’m not sure this is well studied, but if it was, that would give us some evidence about this question. Another relevant result that comes to mind is reward hacking and Goodharting where often models look good when only a little optimization pressure is applied but then it’s pretty easy to overoptimize as you scale u; as I think about these examples it actually seems like this phenomenon is pretty common? And sure, we can quibble about how much optimization pressure is applied in current RL vs. some unknown parallel scaling method, but it seems quite plausible that things will be different at scale and sometimes for the worse.
What links here?
- Aaron_Scher's comment on A non-review of “If Anyone Builds It, Everyone Dies” by boazbarak (29 Sep 2025 4:07 UTC; 27 points)

Aaron_Scher 28 Sep 2025 14:38 UTC
2 points
4
on: Learnings from AI safety course so far
it is easier for me to get speakers from OpenAI
FWIW, I’m a bit surprised by this. I’ve heard of many AI safety programs succeeding in getting a bunch of interesting speakers from across the field. In case you haven’t tried very hard, consider sending some cold emails, the hit rate is often high in this community.

Aaron_Scher 27 Sep 2025 16:50 UTC
7 points
1
on: Contra Collier on IABIED
I’m not aware of a better place to put replies to Asterisk work, so I’ll leave a comment here complaining about something else in Clara’s review. (disclaimer I work at MIRI but am speaking for myself)
In fact, there are plenty of reasons why the fact that AIs are grown and not crafted might cut against the MIRI argument. For one: The most advanced, generally capable AI systems around today are trained on human-generated text, encoding human values and modes of thought. So far, when these AIs have acted against the interests of humans, the motives haven’t exactly been alien. If sycophantic chatbots tempt users into dependency and even psychosis, it’s for the very comprehensible reason that sycophancy increases engagement, which makes the models more profitable. [example about Sydney follows]
I’m pretty familiar with the research on sycophancy, and it was my main focus of research for a few months. The leading hypothesis in the AI alignment community is that sychophancy is basically the result of reward hacking on human feedback (a proxy objective for what we actually want to measure). Unfortunately, we don’t know that this hypothesis is true, and I think we shouldn’t even be very confident in it (I’ll say more at the end of this comment). Broadly, I would claim that we know almost nothing about the internal cognition or cognition-related reasons for current LLM behavior. Given our current state of knowledge, we cannot conclude the claim that Clara asserts, that particular bad behavior happens for “very comprehensible reason”. Ch 11 of the book, “An Alchemy, Not a Science”, talks about the state of AI alignment.
Personally, I am not against using evidence about the alignment or goals of current AIs as evidence about the goals of future AIs. But we still know so little about the goals of current AIs. There are still so many unexplained edge cases, we don’t seem to be able to predict or control generalization to new distributions, and we have no idea what cognition produces the behavior we see—most evidence is behavioral. (My current take is that the evidence we have is a mix of promising and scary, and thus not a huge update about future AIs, but certainly not an update toward “things will be totally fine”.)
What do you know and how do you know it?
It sure looks to me like the evidence does not support confident conclusions about why current LLMs do bad behavior like sycophancy.
Probably the best paper about sycophancy, Sharma et al. 2023, tries to test this “reward hacking” Hypothesis for sycophancy in Section 4. Under this Hypothesis, we would expect sycophancy to go up when optimizing against a human-feedback-trained preference model. In the following figure, this Hypothesis would predict that the blue line goes up in all three of the plots in (a), and that all the lines go up in (b). It also predicts that in (a) the green line should be lower than the blue line. What did the evidence say:
Well, uhhh, that’s pretty mixed evidence. In (a) the blue line only goes up in ¹⁄₃ types of sycophancy, in (b) ²⁄₃ types of sycophancy go up over RL training, and the green line is pretty consistently below the blue line, though not by much on two of the measures—call it 2.5/3. Additionally, another major concern for the Hypothesis is that sycophancy is substantial at the beginning of RL training—probably not what the Hypothesis would have predicted.
I’m not trying to tell you that the reward hacking hypothesis is certainly false, or that a similar hypothesis about increased engagement (as Clara writes) is certainly false. We don’t have the type of understanding that would confer such confidence. I think it is bad practice to act as if one of these hypotheses is so likely to be true that we can usefully learn from it to inform our perception of risks from superintelligent AI. Again, what do you know and how do you know it?

Aaron_Scher 17 Sep 2025 21:24 UTC
63 points
37
on: I enjoyed most of IABIED
I feel a bit surprised by how much you dislike Section 3. I agree that it does not address ‘the strongest counterarguments and automated-alignment plans that haven’t been written down publicly’; this is a weakness but seems too demanding given what’s public.
I particularly like the analogy to alchemy presented in Chapter 11. I think it is basically correct (or as correct as analogies get) that the state of AI alignment research is incredibly poor and the field is in its early stages where we have no principled understanding of anything (my belief here is based on reading or skimming basically every AI safety paper in 2024). The next part of the argument is like “we’re not going to be able to get from the present state of alchemy to a ‘mature scientific field that doesn’t screw up certain crucial problems on the first try’ in time”. That is, 1: the field is currently very early stages without principled understanding, 2: we’re not going to be able to get from where we are now to a sufficient level by the time we need.
My understanding is that your disagreement is with 2? You think that earlier AIs are going to be able to dramatically speed up alignment research (and by using control methods we can get more alignment research out of better AIs, for some intermediate capability levels), getting us to the principled, doesn’t-mess-up-the-first-try-on-any-critical-problem place before ASI.
Leaning into the analogy, I would describe what I view as your position as “with AI assistance, we’re going to go from alchemy to first-shot-moon-landing in ~3 years of wall clock time”. I think it’s correct for people to think this position is very crazy at first glance. I’ve thought about it some and think it’s only moderately crazy. I am glad that Ryan is working on better plans here (and excited to potentially update my beliefs, as I did when you all put out various pieces about AI Control), but I think the correct approach for people hearing about this plan is to be very worried about this plan.
I really liked Section 3, especially Ch 11, because it makes this (IMO) true and important point about the state of the AI alignment field. I think this argument stands on its own as a reason to have an AI moratorium, even absent the particular arguments about alignment difficulty in Section 1. Meanwhile, it sounds like you don’t like this section because, to put it disingenuously, “they don’t engage with my favorite automating-alignment plan that tries to get us from alchemy to first-shot-moon-landing in ~3 years of wall clock time and that hasn’t been written down anywhere”.
Also, if you happen to disagree strongly with the analogy to alchemy or 1 above (e.g., think it’s an incorrect frame), that would be interesting to hear! Perhaps the disagreement is in how hard alignment problems will be in the development of ASI; for example, if the alchemists merely had to fly a blimp first try, rather than land a rocket on the moon? Perhaps you don’t expect there to be any significant discontinuities and this whole “first try” claim is wrong and we’ll never need a principled understanding?
I found this post and your review to be quite thoughtful overall!

Aaron_Scher 3 Sep 2025 16:56 UTC
5 points
1
in reply to: Nick Bostrom’s comment on: Open Global Investment as a Governance Model for AGI
> It doesn’t particularly address the situation in which an AGI on its own initiative tries to take over the world.
That is a concern common to all of the governance models
I think this is wrong. The MIRI Technical Governance Team, which I’m part of, recently wrote this research agenda which includes an “Off switch and halt” plan for governing AI. Stopping AI development before superintelligence directly addresses the situation where an ASI tries to take over the world by not allowing such AIs to be built. If you like the frame of “who has a veto”, I think at the very least it’s “every nuclear-armed country has a veto” or something similar.
A deterrence framework—which could be leveraged to avoid ASI being built and thus impacts AI takeover risk—also appears in Superintelligence Strategy.