Things I post should be considered my personal opinions, not those of any employer, unless stated otherwise.
Aaron_Scher
We care about the performance prediction at a given point in time for skills like “take over the world”, “invent new science”, and “do RSI” (and “automate AI R&D”, which I think the benchmark does speak to). We would like to know when those skills will be developed.
In the frame of this benchmark, and Thomas and Vincent’s follow up work, it seems like we’re facing down at least three problems:
The original time horizons tasks are clearly out of the distribution we care about. Solution: create a new task suite we think is the right distribution.
We don’t know how well time horizons will do at predicting future capabilities, even in this domain. Solution: keep collecting new data as it comes out in order to test predictions on whatever distributions we have, examine things like the conceptual coherence objection and try to make progress.
We don’t know how well the general “time horizons” approach works across domains. We have some data on this in the follow up work, maybe it’s a 2:1 update from a 1:1 prior?
So my overall take is that I think the current work I’m aware of tells us
Small positive update on time horizons being predictive at all.
A small positive update on the specific Software Engineering trends being predictive within distribution.
Small positive update on “time horizons” being common across different reasonable and easy to define distributions.
And on “doubling time in the single digit months” being the rate of time horizon increase across many domains.
A small negative update on the specific time horizon length from one task distribution generalizing to other task distributions (maybe an update, tbh the prior is much lower than 50⁄50). So it tells us approximately nothing about the performance prediction at a given point in time for the capabilities I care about.
That you think are sufficient to meaningfully change how valid people should interpret the overall predicted trend?
I’m not Adam, but my response is “No”, based on the description Megan copied in thread and skimming some of the paper. It’s good that the paper includes those experiments, but they don’t really speak to the concerns Adam is discussing. Those concerns, as I see it (I could be misunderstanding):
Conceptual coherence: in humans there are different skills, e.g., between different fields, that don’t seem to easily project onto a time horizon dimension. Or like, our sense of how much intelligence is required for them or how difficult they are does not correspond all that closely with the time taken to do them.
Benchmark bias: solution criteria is known and progress criteria is often known; big jump from that to the real world scary things we’re worried about.
Do the experiments in Sec 6 deal with this?
No SWAA (“Retrodiction from 2023–2025 data”): Does not deal with 2. Mostly does not deal with 1, as both HCAST + RE-Bench and All-3 are mostly sofware engineerig dominated with a little bit of other stuff.
Messiness factors: Does not speak to 1. This is certainly relevant to 2, but I don’t think it’s conclusive. Quoting from the paper some:
We rated HCAST and RE-Bench tasks on 16 properties that we expected to be 1) representative of how real world tasks might be systematically harder than our tasks and 2) relevant to AI agent
performance. Some example factors include whether the task involved a novel situation, was constrained by a finite resource, involved real-time coordination, or was sourced from a real-world
context. We labeled RE-bench and HCAST tasks on the presence or absence of these 16 messiness
factors, then summed these to obtain a “messiness score” ranging from 0 to 16. Factor definitions
can be found in Appendix D.4.The mean messiness score amongst HCAST and RE-Bench tasks is 3.2/16. None of these tasks have a messiness score above 8⁄16. For comparison, a task like ’write a good research paper’ would score between 9⁄16 and 15⁄16, depending on the specifics of the task.
On HCAST tasks, AI agents do perform worse on messier tasks than would be predicted from the
task’s length alone (b=-0.081, R2 = 0.251) …However, trends in AI agent performance over time are similar for lower and higher messiness
subsets of our tasks.This seems like very weak evidence in favor of the hypothesis that Benchmark Bias is a big deal. But they just don’t have very messy tasks.
c. SWE-Bench Verified: doesn’t speak to 1 or 2.
d. Internal PR experiments: Maybe speaks a little to 1 and 2 because these are more real world, closer to the thing we care about tasks, but not much, as they’re still clearly verifiable and still software engineering.
I do think Thomas and Vincent’s follow up work here on time horizons for other domains is useful evidence pointing a little against the conceptual coherence objection. But only a little.
Any guesses for what’s going on in Wichers et al. 3.6.1?
3.6.1 IP ON CLEAN DATA
A potential concern about IP is that if it is applied to “clean” data in which the undesired behavior is not demonstrated, IP may harm performance. This is important, as many real world datasets contain
mostly examples of desired behavior. We test this by applying IP in our settings on clean data. We
find no significant performance degradation across all settings (Figures 12, 15, 28, 29, 30).It seems like this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt.
More generally, this result combined with the other results seems to imply a strategy of “just throw a system prompt encouraging malicious behavior in front of every model generation regardless of whether the generation is malicious or not, and then train on these” will lead to good behavior when that system prompt is removed at deployment time. That just seems implausible.
Yep, thanks.
Hey, thanks for the feedback! I helped write this section. A few notes:
I think you’re right that comparing to consumer GPUs might make more sense, but I think comparing to other computers is still acceptable. I agree that GPUs is where you start running into prohibitions first. But I think it’s totally fair to compare to “average computers” because one of the main things I care about is the cost of the treaty. It’s not so bad if we have to ban top of the line consumer GPUs, but it would be very costly / impossible if we have to ban consumer laptops. So comparing to both of these is reasonable.
The text says “consumer CPUs” because this is what is discussed in the relevant source, and I wanted to stick to that. Due to some editing that happened, it might not have been totally clear where the claim was coming from. The text has been updated and now there’s a clear footnote.
I know that “consumer CPUs” is not literally the best comparison for, say, consumer laptops. For example, macbooks have an integrated CPU-GPU. I think it is probably true that H100s are like 3-300x better than most consumer laptops at AI tasks, but to my knowledge there is no good citable work explaining this for a wide variety of consumer hardware (I have some mentees working on it now, maybe in a month or two there will be good work!).
I’ll toss out there that NVIDIA sells personal or desktop GPUs that are marketed as for AI (like this one). These are quite powerful, often within 3x of the datacenter GPUs in terms of most of their performance. I expect these to get categorized as “AI chips” under the treaty and thus become controlled. The difference between H100s and top consumer GPUs is not 1000x, and it probably isn’t even 10x. In this tentative draft treaty, we largely try to punt questions like “what exactly counts as an AI chip” to the hypothetical technical body that helps implement the treaty, and my current opinions about this are weak.
Thanks for your reply. Noting that it would have been useful for my understanding if you had also directly answered the 2 clarifying questions I asked.
There are a lot of bad things that AIs can do before literally taking over the world.
Okay, it does sound like you’re saying we can learn from problems A, B, and C in order to inform D. Where D is the model tries to take over once it is smart enough. And A is like jailbreak-ability and B is goal preservation. It seems to me like somebody who wants humanity to gamble on the superalignment strategy (or otherwise build ASI systems at all, though superalignment is a marginally more detailed plan) needs to argue that our methods for dealing with A, B, and C are very likely to generalize to D.
Maybe I’m misunderstanding though, it’s possible that you mean the same AIs that want to eventually take over will also take a bunch of actions to tip their hand earlier on. This seems mostly unlikely to me, because that’s an obviously dumb strategy and I expect ASIs to not pursue dumb strategies. I agree that current AIs do dumb things like this, but these are not the AIs I’m worried about.
Whether that would be enough to take over the world at that point in time is a different questoin.
To repeat my second clarifying question from above, do you believe that at some point there will be AIs that could succeed at takeover if they tried? If we were talking about the distribution shift that a football team undergoes from training to Game Day, and you didn’t think the game would ever happen, that sounds like it’s the real crux, not some complicated argument about how well the training drills match the game.
Either I don’t understand your view about continuous development and threshold things, or I think it’s probably wrong. I’ve decided to spin this out into its own comment, though I originally wrote it as part of this one.
I’m going to think out loud: One story I could tell for how we cross the gap from “the AI would not succeed at taking over” to “it would succeed if it tried”, in a continuous manner, is as @Nina Panickssery points out in a tweet, “The AI could be more or less likely to succeed”.
(My understanding is that we’re actually focusing on the AI’s beliefs about its likelihood of success, which I’ll call AI-p(success), not p(success) according to some observer—the AI’s beliefs are what shape its decisions.) So here’s an example model: maybe as the AI gets smarter or acquires more resources, its p(success) rises at 1 percentage point per day. We think the AI might do something like an expected value calculation, only trying to takeover when p(success) crosses some threshold. (But, like, we don’t know that that’s true. And we don’t know what the threshold is.) If the AI is doing some threshold thing like this, then it looks all good and chill until day 23 or whatever when the threshold is hit and the AI fully commits to a takeover. Under that AI decision procedure, it looks like by default we get this rough discontinuity where we don’t get any more iteration after day 22.
Now imagine a really cautious AI company, maybe during each day they try really hard to trick some instance of the AI that its AI-p(success) is actually higher than it is, and they’re able to push up AI-p(success) by 5 percentage points even while the actual likelihood of takeover is not inflated by 5 points (maybe they also do extra security measures for this AI, and together the adversarial testing and security create a 5% spread). So then on day 18 the AI tries to take over and we try not to be unlucky on an 18% takeover chance. And if we succeed at containing the AI, maybe we can run more experiments on this 18 day AI. What experiments are we going to run? I guess we’ll try to make it so that even when AI-p(success) exceeds 23% the AI does not try to take over, but idk how.
There are many issues with this plan:
We don’t actually know the decision process of the AI and don’t know if this model is at all right. Maybe instead it has the totally illogical decision process of trying hard to takeover in proportion to AI-p(success), so on day 1 it tries very little, on day 10 some, and on day 50 it tries pretty hard; that seems like an insanely bad strategy (and humanity might actually win if the AIs always took strategies like this), but there’s a bunch of other strategies the AI could be taking, some of which could be better or worse than the model here.
Even if we know the high-level decision process of the AI, we don’t know the threshold. (Aside: I assume the AI is well calibrated and a good forecaster, other than the modifications we make, i.e., its p(success) is accurate). If the threshold is high, then we are basically screwed in this scenario. Like if the AI follows the policy “only attempt take over if AI-p(success) ≥ 99%”, then we’re taking on a huge amount of risk (i.e., dying) unless we could somehow artificially modulate AI-p(success) by a ton. Like, using the numbers above, then the AI tries to takeover when its actual chance of success is actually merely 94%, ruh roh.
It’s not obvious that we’ll be able to measure AI-p(success) with much confidence. Even if we knew the threshold, we probably can’t actually measure the thing. AI mind reading seems kinda hard, and while I definitely think it’s plausible that we’ll get this, I might be at like 50⁄50. Doing the above scheme without being able to measure AI-p(success) seems extremely difficult due to not knowing if your methods for modifying AI-p(success) are working.
I don’t expect us to know observer-p(success) with much confidence. It seems like most of the claims from AI companies about current AI capabilities are pretty soft, unconfident claims. For instance, OpenAI, Anthropic, and Google DM have all said in the last few months that their frontier models might be capable of significant CBRN (bio especially) uplift for unskilled actors, but they aren’t sure. OpenAI: “We do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm, our defined threshold for High capability, and the model remains on the cusp of being able to reach this capability.”Anthropic: “Although we have not determined whether the ASL-3 Standard is required for this model”. Google: “Further study is required to reach our final assessment of whether the model has reached the CCL for CBRN Uplift Level 1.” Now maybe this is just corporate hedging, but I think it’s way more likely, and supported by a preponderance of evidence, that assessing AI capabilities, especially for real world tasks, is hard to do and we are currently bad at it. I basically expect that in a couple years we’ll enter a regime of “the experts do not know if frontier AIs would be able to successfully escape their containment or successfully takeover if they tried, it seems like at least 1% likely” and probably our p(success) will go up as AI capabilities increase, but it won’t be well justified ever and there will be significant disagreement about the number among experts.
It’s unclear if we’ll be able to change AI-p(success) much. 5 percentage points felt reasonable but idk. This is certainly a place where if we don’t try very hard at this kind of red-teaming, then we certainly won’t succeed at this part.
Unfortunately, it looks like it might be dangerous to experiment with the 18-day AI, on account of it having an 18% chance of successful take over. It’s unclear how correlated outcomes will be across different experiments (i.e., are you rolling 18% each time or not), especially considering how the humans can potentially patch various pathways that different instances take when attempting takeover. But gosh, looks rough.
It’s not clear what experiments one would run with the 18-day AI that would actually help cross the gap robustly. For instance, maybe we’ll just end up pushing the threshold up a couple days or changing the AI’s decision process; it seems like there are a lot of ways we could fail at this step.
I could be totally misunderstanding Nina’s idea, this is all very complicated.
I’m not sure what Eliezer thinks, but I don’t think it’s true that “you cannot draw any useful lessons from [earlier] cases”, and that seems like a strawman of the position. They make a bunch of analogies in the book, like you launch a rocket ship, and after it’s left the ground, your ability to make adjustments is much lower; sure you can learn a bunch in simulation and test exercises and laboratory environments, but you are still crossing some gap (see p. ~163 in the book for full analogy). There are going to be things about the Real Deal deployment that you were not able to test for. One of those things for AI is that “try to take over” is a more serious strategy, somewhat tautologically because the book defines the gap as:
Before, the AI is not powerful enough to kill us all, nor capable enough to resist our attempts to change its goals. After, the artificial superintelligence must never try to kill us, because it would succeed. (p. 161)
I don’t see where you are defusing this gap or making it nicely continuous such that we could iteratively test our alignment plans as we cross it.
It seems like maybe you’re just accepting that there is this one problem that we won’t be able to get direct evidence about in advance, but you’re optimistic that we will learn from our efforts to solve various other AI problems which will inform this problem.
When you say “by studying which alignment methods scale and do not scale, we can obtain valuable information”, my interpretation is that you’re basically saying “by seeing how our alignment methods work on problems A, B, and C, we can obtain valuable information about how they will do on separate problem D”. Is that right?
Just to confirm, do you believe that at some point there will be AIs that could succeed at takeover if they tried? Sometimes I can’t tell if the sticking point is that people don’t actually believe in the second regime.
I don’t believe that there is going to be an alignment technique that works one day and completely fails after a 200K GPU 16 hour run.
There are rumors that many capability techniques work well at a small scale but don’t scale very well. I’m not sure this is well studied, but if it was, that would give us some evidence about this question. Another relevant result that comes to mind is reward hacking and Goodharting where often models look good when only a little optimization pressure is applied but then it’s pretty easy to overoptimize as you scale u; as I think about these examples it actually seems like this phenomenon is pretty common? And sure, we can quibble about how much optimization pressure is applied in current RL vs. some unknown parallel scaling method, but it seems quite plausible that things will be different at scale and sometimes for the worse.
it is easier for me to get speakers from OpenAI
FWIW, I’m a bit surprised by this. I’ve heard of many AI safety programs succeeding in getting a bunch of interesting speakers from across the field. In case you haven’t tried very hard, consider sending some cold emails, the hit rate is often high in this community.
I’m not aware of a better place to put replies to Asterisk work, so I’ll leave a comment here complaining about something else in Clara’s review. (disclaimer I work at MIRI but am speaking for myself)
In fact, there are plenty of reasons why the fact that AIs are grown and not crafted might cut against the MIRI argument. For one: The most advanced, generally capable AI systems around today are trained on human-generated text, encoding human values and modes of thought. So far, when these AIs have acted against the interests of humans, the motives haven’t exactly been alien. If sycophantic chatbots tempt users into dependency and even psychosis, it’s for the very comprehensible reason that sycophancy increases engagement, which makes the models more profitable. [example about Sydney follows]
I’m pretty familiar with the research on sycophancy, and it was my main focus of research for a few months. The leading hypothesis in the AI alignment community is that sychophancy is basically the result of reward hacking on human feedback (a proxy objective for what we actually want to measure). Unfortunately, we don’t know that this hypothesis is true, and I think we shouldn’t even be very confident in it (I’ll say more at the end of this comment). Broadly, I would claim that we know almost nothing about the internal cognition or cognition-related reasons for current LLM behavior. Given our current state of knowledge, we cannot conclude the claim that Clara asserts, that particular bad behavior happens for “very comprehensible reason”. Ch 11 of the book, “An Alchemy, Not a Science”, talks about the state of AI alignment.
Personally, I am not against using evidence about the alignment or goals of current AIs as evidence about the goals of future AIs. But we still know so little about the goals of current AIs. There are still so many unexplained edge cases, we don’t seem to be able to predict or control generalization to new distributions, and we have no idea what cognition produces the behavior we see—most evidence is behavioral. (My current take is that the evidence we have is a mix of promising and scary, and thus not a huge update about future AIs, but certainly not an update toward “things will be totally fine”.)
What do you know and how do you know it?
It sure looks to me like the evidence does not support confident conclusions about why current LLMs do bad behavior like sycophancy.
Probably the best paper about sycophancy, Sharma et al. 2023, tries to test this “reward hacking” Hypothesis for sycophancy in Section 4. Under this Hypothesis, we would expect sycophancy to go up when optimizing against a human-feedback-trained preference model. In the following figure, this Hypothesis would predict that the blue line goes up in all three of the plots in (a), and that all the lines go up in (b). It also predicts that in (a) the green line should be lower than the blue line. What did the evidence say:
Well, uhhh, that’s pretty mixed evidence. In (a) the blue line only goes up in 1⁄3 types of sycophancy, in (b) 2⁄3 types of sycophancy go up over RL training, and the green line is pretty consistently below the blue line, though not by much on two of the measures—call it 2.5/3. Additionally, another major concern for the Hypothesis is that sycophancy is substantial at the beginning of RL training—probably not what the Hypothesis would have predicted.
I’m not trying to tell you that the reward hacking hypothesis is certainly false, or that a similar hypothesis about increased engagement (as Clara writes) is certainly false. We don’t have the type of understanding that would confer such confidence. I think it is bad practice to act as if one of these hypotheses is so likely to be true that we can usefully learn from it to inform our perception of risks from superintelligent AI. Again, what do you know and how do you know it?
I feel a bit surprised by how much you dislike Section 3. I agree that it does not address ‘the strongest counterarguments and automated-alignment plans that haven’t been written down publicly’; this is a weakness but seems too demanding given what’s public.
I particularly like the analogy to alchemy presented in Chapter 11. I think it is basically correct (or as correct as analogies get) that the state of AI alignment research is incredibly poor and the field is in its early stages where we have no principled understanding of anything (my belief here is based on reading or skimming basically every AI safety paper in 2024). The next part of the argument is like “we’re not going to be able to get from the present state of alchemy to a ‘mature scientific field that doesn’t screw up certain crucial problems on the first try’ in time”. That is, 1: the field is currently very early stages without principled understanding, 2: we’re not going to be able to get from where we are now to a sufficient level by the time we need.
My understanding is that your disagreement is with 2? You think that earlier AIs are going to be able to dramatically speed up alignment research (and by using control methods we can get more alignment research out of better AIs, for some intermediate capability levels), getting us to the principled, doesn’t-mess-up-the-first-try-on-any-critical-problem place before ASI.
Leaning into the analogy, I would describe what I view as your position as “with AI assistance, we’re going to go from alchemy to first-shot-moon-landing in ~3 years of wall clock time”. I think it’s correct for people to think this position is very crazy at first glance. I’ve thought about it some and think it’s only moderately crazy. I am glad that Ryan is working on better plans here (and excited to potentially update my beliefs, as I did when you all put out various pieces about AI Control), but I think the correct approach for people hearing about this plan is to be very worried about this plan.
I really liked Section 3, especially Ch 11, because it makes this (IMO) true and important point about the state of the AI alignment field. I think this argument stands on its own as a reason to have an AI moratorium, even absent the particular arguments about alignment difficulty in Section 1. Meanwhile, it sounds like you don’t like this section because, to put it disingenuously, “they don’t engage with my favorite automating-alignment plan that tries to get us from alchemy to first-shot-moon-landing in ~3 years of wall clock time and that hasn’t been written down anywhere”.
Also, if you happen to disagree strongly with the analogy to alchemy or 1 above (e.g., think it’s an incorrect frame), that would be interesting to hear! Perhaps the disagreement is in how hard alignment problems will be in the development of ASI; for example, if the alchemists merely had to fly a blimp first try, rather than land a rocket on the moon? Perhaps you don’t expect there to be any significant discontinuities and this whole “first try” claim is wrong and we’ll never need a principled understanding?
I found this post and your review to be quite thoughtful overall!
> It doesn’t particularly address the situation in which an AGI on its own initiative tries to take over the world.
That is a concern common to all of the governance modelsI think this is wrong. The MIRI Technical Governance Team, which I’m part of, recently wrote this research agenda which includes an “Off switch and halt” plan for governing AI. Stopping AI development before superintelligence directly addresses the situation where an ASI tries to take over the world by not allowing such AIs to be built. If you like the frame of “who has a veto”, I think at the very least it’s “every nuclear-armed country has a veto” or something similar.
A deterrence framework—which could be leveraged to avoid ASI being built and thus impacts AI takeover risk—also appears in Superintelligence Strategy.
So the margins for input tokens will be more important for the API provider, and the cost there depends on active params (but not total params). While costs for output tokens depend on context size of a particular query and total params (and essentially don’t depend on active params), but get much less volume.
Could you explain the reasoning behind this? Or link to an existing explanation?
I initially shared this on Twitter. I’m copying over here because I don’t think it got enough attention. Here’s my current favorite LLM take (epistemic status is conspiratorial speculation).
We can guesstimate the size of GPT-5 series models by assuming that OpenAI wouldn’t release the gpt-oss models above the cost-performance pareto curve. We get performance from benchmarks, e.g., ArtificialAnalysis index.
That gives us the following performance ordering: 20B < nano < 120B < mini < full. But the ordering depends on where you get #s and what mix. From model cards: Nano < 20B < 120B ≤ mini < full. We get cost from model size, recall the oss models:
20b: 21B-A3.6B
120b: 117B-A5.1B
Now time for the wild speculation. I think this implies that Nano is order 0.5-3B. Mini is probably in the 3B-6B range, and GPT-5 full is probably in the 4-40B range. For active parameters. This would be consistent with the 5x pricing diff between each, if e.g., 1B, 5B, 25B.
This is super speculative obviously. We don’t even know model architecture, and something totally different could be happening behind the scenes.
One implication of this analysis pointing to such small model sizes is that it indicates GPT-5 *really* wasn’t a big scale-up in compute, maybe even a scale down vs. 4, almost certainly a scale down vs. 4.5.
API pricing for each of these models at release (per 1m input/output): GPT-4: 30⁄60 (rumored to be 1800B-A280B) GPT 4.5: 75⁄150 GPT-5: 1.25/10 If you’re curious about the pricing ratio from 4 to 5, it might imply active parameter counts in the range of 12B-47B for 5.
Thanks to ArtificialAnalysis who is doing a public service by aggregating data about models. Also here’s a fun graph.
The successors will have sufficiently similar goals as its predecessor by default. It’s hard to know how likely this is, but note that this is basically ruled out if the AI has self-regarding preferences.
1. Why? What does self-regarding preferences mean and how does it interact with the likelihood of predecessor AIs sharing goals with later AIs?
Given that the lab failed to align the AI, it’s unclear why the AI will be able to align its successor, especially if it has the additional constraint of having to operate covertly and with scarcer feedback loops.
2. I don’t thing this is right. By virtue of the first AI existing, there is a successful example of ML producing an agent with those particular goals. The prior on the next AI having those goals jumps a bunch relative to human goals. (Vague credit to evhub who I think I heard this argument from). It feels like this point about Alignment has decent overlap with Convergence.
Overall, very interesting and good post.
This increase occurred between 1950 and 1964, and leveled off thereafter.
Hm, this data doesn’t feel horribly strong to me. What happened from 1965-1969, why is that data point relatively low, seems inconsistent with the poisoning theory? My prior is that data is noisy and it is easy to see effects that don’t mean much. But this is an interesting and important topic, and I’m sorry it’s infeasible to access better data.
Neat, weird.
I get similar results when I ask “What are the best examples of reward hacking in LLMs?” (GPT-4o). When I then ask for synonyms of “Thumbs-up Exploitation” the model still does not mention sycophancy but then I push harder and it does.
Asking “what is it called when an LLM chat assistant is overly agreeable and tells the user what the user wants to hear?” on the first try the model says sycophancy, but much weirder answers in a couple other generations. Even got a “Sy*cophancy”.
I got the model up to 3,000 tokens/s on a particularly long/easy query.
As an FYI, there has been other work on large diffusion language models, such as this: https://www.inceptionlabs.ai/introducing-mercury
We should also consider that, well, this result just doesn’t pass the sniff test given what we’ve seen RL models do.
FWIW, I interpret the paper to be making a pretty narrow claim about RL in particular. On the other hand, a lot of the production “RL models” we have seen may not be pure RL. For instance, if you wanted to run a similar test to this paper on DeepSeek-V3+, you would compare DeepSeek-V3 to DeepSeek-R1-Zero (pure RL diff, according to the technical report), not to DeepSeek-R1 (trained with a hard-to-follow mix of SFT and RL). R1-Zero is a worse model than R1, sometimes by a large margin.
I think this is true, but I also don’t think seeing the skulls implies actually dealing with them (and wish Scott’s post was crossposted here so I could argue with it). Like, a critique of AI evaluations that people could have been making for the last 5+ years (probably even 50) and which remains true today is “Evaluations do a poor job measuring progress toward AGI because they lack external validity. They test scenarios that are much narrower, well defined, more contrived, easier to evaluate, etc. compared to the skills that an AI would need to be able to robustly do in order for us to call it AGI.” I agree that METR is well aware of this critique, but the critique is still very much true of HCAST, RE-Bench, and SWAA. Folks at METR seem especially forward about discussing the limitations of their work in this regard, and yet the critique is still true. (I don’t think I’m disagreeing with you at all)