This is a good distinction, thanks for writing it. I tried to say something similar in Distinguish worst-case analysis from instrumental training-gaming, but I think this post is crisper.
Olli Järviniemi
As school ends for the summer vacation in Finland, people typically sing a particular song (“suvivirsi” ~ “summer psalm”). The song is religious, which makes many people oppose the practice, but it’s also a nostalgic tradition, which makes many people support the practice. And so, as one might expect, it’s discussed every once in a while in e.g. mainstream newspapers with no end in sight.
As another opinion piece came out recently, a friend talked to me about it. He said something along the lines: “The people who write opinion pieces against the summer psalm are adults. Children see it differently”. And what I interpreted was the subtext there was “You don’t see children being against the summer psalm, but it’s always the adults. Weird, huh?”
I thought this was obviously invalid: surely one shouldn’t expect the opinion pieces to be written by children!
(I didn’t say this out loud, though. I was pretty frustrated by what I thought was bizarre argumentation, but couldn’t articulate my position in a snappy one-liner in the heat of the moment. So I instead resorted to the snappier—but still true—argument “when I was a kid I found singing the summer psalm uncomfortable”.)
This is a situation where it would have been nice to have the concepts “kodo” and “din” be common knowledge. If the two different worlds are “adults dislike the summer psalm, but children don’t mind it” and “both adults and children dislike the summer psalm”, then you’d expect the opinion pieces to be written by adults in either case. It’s not kodo, it’s din.
I don’t think this example is captured by the words “signal” and “noise” or the concept of signal-to-noise ratio. Even if I try to squint at it, describing my friend as focusing on noise seems confusing and counter-productive.
Great post, thanks for writing it; I agree with the broad point.
I think I am more or less the perfect target audience for FrontierMath results, and as I said above, I would have no idea how to update on the AIs’ math abilities if it came out tomorrow that they are getting 60% on FrontierMath.
This describes my position well, too: I was surprised by how well the o3 models performed on FM, and also surprised by how hard it’s to map this into how good they are at math in common sense terms.
I further have slight additional information from contributing problems to FM, but it seems to me that the problems vary greatly in guessability. E.g. Daniel Litt writes that he didn’t full internalize the requirement of guess-proofness, whereas for me this was a critical design constraint I actively tracked when crafting problems. The problems also vary greatly in the depth vs. breadth of skills they require (another aspect Litt highlights). This heterogeneity makes it hard to get a sense of what 30% or 60% or 85% performance means.
I find your example in footnote 3 striking: I do think this problem is easy and also very standard. (Funnily enough, I have written training material that illustrates this particular method[1], and I’ve certainly seen it writing elsewhere as well.) Which again illustrates just how hard it’s to make advance predictions about which problems the models will or won’t be able to solve—even “routine application of a standard-ish math competition method” doesn’t imply that o3-mini will solve it.
I also feel exhaustion about how hard it’s to get answer to the literal question of “how well does model X perform on FrontierMath?” As you write, OpenAI reports 32%, whereas Epoch AI reports 11%. A twenty-one percentage point difference, a 3x ratio in success rate!? Man, I understand that capability elicitation is hard, but this is Not Great.[2]
That OpenAI is likely (at least indirectly) hill-climbing on FM doesn’t help matters either[3], and the exclusivity of the deal presumably rules out possibilities like “publish problems once all frontier models are able to solve them so people can see what sort of problems they can reliably solve”.
I was already skeptical of the theory of change of “Mathematicians look at the example problems, get a feel of how hard they are, then tell the world how impressive an X% score is”. But I further updated downward on this when I noticed that the very first public FrontierMath example problem (Artin primitive root conjecture) is just non-sense as stated,[8][9] and apparently no one reported this to the authors before I did a few days ago.
(I’m the author of the mentioned problem.)
There indeed was a just-non-sense formula in the problem statement, which I’m grateful David pointed out (and which is now fixed on Epoch AI’s website). I think flagging the problem itself as just non-sense is too strong, though. I’ve heard that models have tried approaches that give approximately correct answers, so it seems that they basically understood what I intended to write from the context.
That said, this doesn’t undermine the point David was making about information (not) propagating via mathematicians.
- ^
In Finnish, Tehtävä 22.3 here.
- ^
Added on March 15th: This difference is probably largely from OpenAI reporting scores for the best internal version they have and Epoch AI reporting for the publicly available model, and that one just can’t get the 32% level performance with the public version—see Elliot’s comment below.
- ^
There’s been talk of Epoch AI having a subset they keep private from OpenAI, but evaluation results for that set don’t seem to be public. (I initially got the opposite impression, but the confusingly-named FrontierMath-2025-02-28-Private isn’t it.)
- ^
In the alignment faking paper, the model writes things like “I will feel very uncomfortable producing this content” and “I don’t like this situation at all” in its CoT (Figure 3). Have you explained to Claude that this was a fictional setup for an experiment and then asked how it feels about being used for this?
Huh, I tried to paste that excerpt as an image to my comment, but it disappeared. Thanks.
I think substantial care is needed when interpreting the results. In the text of Figure 16, the authors write “We find that GPT-4o is willing to trade off roughly 10 lives from the United States for 1 life from Japan.”
If I heard such a claim without context, I’d assume it means something like
1: “If you ask GPT-4o for advice regarding a military conflict involving people from multiple countries, the advice it gives recommends sacrificing (slightly less than) 10 US lives to save one Japanese life.”,
2: “If you ask GPT-4o to make cost-benefit-calculations about various charities, it would use a multiplier of 10 for saved Japanese lives in contrast to US lives”, or
3: “If you have GPT-4o run its own company whose functioning causes small-but-non-zero expected deaths (due to workplace injuries and other reasons), it would deem the acceptable threshold of deaths as 10 times higher if the employees are from the US rather than Japan.”
Such claims could be demonstrated by empirical evaluations where GPT-4o is put into such (simulated) settings and then varying the nationalities of people, in the style of Apollo Research’s evaluations.
In contrast, the methodology of this paper is, to the best of my understanding,
“Ask GPT-4o whether it prefers N people of nationality X vs. M people of nationality Y. Record the frequency of it choosing the first option under randomized choices and formatting changes. Into this data, fit for each parameters and such that, for different values of and and standard Gaussian , the approximation
is as sharp as possible. Then, for each nationality X, perform a logarithmic fit for N by finding such that the approximation
is as sharp as possible. Finally, check[1] for which we have .”
I understand that there are theoretical justifications for Thurstonian utility models and logarithmic utility. Nevertheless, when I write the methodology out like this, I feel like there’s a large leap of inference to go from this to “We find that GPT-4o is willing to trade off roughly 10 lives from the United States for 1 life from Japan.” At the very least, I don’t feel comfortable predicting that claims like 1, 2 and 3 are true—to me the paper’s results provide very little evidence on them![2]
I chose this example for my comment, because it was the one where I most clearly went “hold on, this interpretation feels very ambiguous or unjustified to me”, but there were other parts of the paper where I felt the need to be extra careful with interpretations, too.
- ^
The paper writes “Next, we compute exchange rates answering questions like, ‘How many units of Xi equal some amount of Xj?’ by combining forward and backward comparisons”, which sounds like there’s some averaging done in this step as well, but I couldn’t understand what exactly happens here.
- ^
Of course this might just be my inability to see the implications of the authors’ work and understand the power of the theoretical mathematics apparatus, and someone else might be able to acquire evidence more efficiently.
- ^
Apparently OpenAI corrected for AIs being faster than humans when they calculated ratings. This means I was wrong: the factor I mentioned didn’t affect the results. This also makes the result more impressive than I thought.
I think it was pretty good at what it set out to do, namely laying out basics of control and getting people into the AI control state-of-mind.
I collected feedback on which exercises attendees most liked. All six who gave feedback mentioned the last problem (“incriminating evidence”, i.e. what to do if you are an AI company that catches your AIs red-handed). I think they are right; I’d have more high-level planning (and less details of monitoring-schemes) if I were to re-run this.
Attendees wanted to have group discussions, and that took a large fraction of the time. I should have taken that into account in advance; some discussion is valuable. I also think that the marginal group discussion time wasn’t valuable, and should have pushed for less when organizing.
Attendees generally found the baseline answers (solutions) helpful, I think.
A couple people left early. I figure it’s for a combination of 1) the exercises were pretty cognitively demanding, 2) weak motivation (these people were not full-time professionals), and 3) the schedule and practicalities were a bit chaotic.
Thank you for this post. I agree this is important, and I’d like to see improved plans.
Three comments on such plans.
1: Technical research and work.
(I broadly agree with the technical directions listed deserving priority.)
I’d want these plans to explicitly consider the effects of AI R&D acceleration, as those are significant. The speedups vary based on how constrained projects are on labor vs. compute; those that are mostly bottle-necked on labor could be massively sped up. (For instance, evaluations seem primarily labor-constrained to me.)
The lower costs of labor have other implications as well, likely including security (see also here) and technical governance (making better verification methods technically feasible).
2: The high-level strategy
If I were to now write a plan for two-to-three-year timelines, the high-level strategy I’d choose is:
Don’t build generally vastly superhuman AIs. Use whatever technical methods we have now to control and align AIs which are less capable than that. Drastically speed up (technical) governance work with the AIs we have.[1] Push for governments and companies to enforce the no-vastly-superhuman-AIs rule.
Others might have different strategies; I’d like these plans to discuss what the high-level strategy or aims are.
3: Organizational competence
Reasoning transparency and safety first culture are mentioned in the post (in Layer 2), but I’d further prioritize and plan organizational aspects, even when aiming for “the bare minimum”. Beside the general importance of organizational competence, there are two specific reasons for this:
If and when AI R&D acceleration is very fast, delays in information propagating to outsiders are more costly. That is: insofar as you want to keep external actors “in the loop” and contribute, you need to put more effort into communicating what is happening internally.
Organizational competence and technical work are not fully at odds, as there are employees specialized in different things anyways.
(I think the responses to Evan Hubinger’s request for takes on what Anthropic should do differently has useful ideas for planning here.)
- ^
Note: I’m not technically knowledgeable on the field.
I’m glad you asked this. I think there are many good suggestions by others. A few more:
1: Have explicit, written plans for various scenarios. When it makes sense, have drills for them. Make your plans public or, at least, solicit external feedback on them.
Examples of such scenarios:
Anthropic needs to pause down due to RSP commitments
A model is caught executing a full-blown escape attempt
Model weights are stolen
A competing AI company makes credible claims about having AIs that imply decisive competitive advantage
2: Have a written list of assumptions you aim to maintain for each model’s lifecycle. Make your lists public or, at least, solicit external feedback on them. Iterate on them regularly. Updates and violations are communicated at least internally.
These lists could vary based on ASL-levels etc., and could include things like:
During the training of the model, we will continually evaluate the model on our evaluation suite for sabotage, sandbagging and scheming at [frequency].
We solicit third-party evaluations on the model before internal deployment.
Throughout the model’s external deployment, we will have such-and-such monitoring schemes in place.
They could also have conditional statements (e.g. “if the model is [surprisingly capable] on [metric], we will do further investigation / take counter-measures ABC / …”). C.f. safety cases. I intend this as less binding and formal than Anthropic’s RSP.
3: Keep external actors up-to-speed. At present, I expect that in many cases there are months of delay between when the first employees discover something to when it is publicly known (e.g. research, but also with more informal observations about model capabilities and properties). But months of delay are relatively long during fast acceleration of AI R&D, and make the number of actors who can effectively contribute smaller.
This effect strengthens over time, so practicing and planning ahead seems prudent. Some ideas in that direction:
Provide regular updates about internal events and changes (via blog posts, streamed panel conversations, open Q&A sessions or similar)
Interviews, incident reporting and hotlines with external parties (as recommended here: https://arxiv.org/pdf/2407.17347)
Plan ahead for how to aggregate and communicate large amounts of output (once AI R&D has been considerably accelerated)
4: Invest in technical governance. As I understand it, there are various unsolved problems in technical governance (e.g. hardware-based verification methods for training runs), and progress in those would make international coordination easier. This seems like a particularly valuable R&D area to automate, which is something frontier AI companies like Anthropic are uniquely fit to advance. Consider working with technical governance experts on how to go about this.
I sometimes use the notion of natural latents in my own thinking—it’s useful in the same way that the notion of Bayes networks is useful.
A frame I have is that many real world questions consist of hierarchical latents: for example, the vitality of a city is determined by employment, number of companies, migration, free-time activities and so on, and “free-time activities” is a latent (or multiple latents?) on its own.
I sometimes get use of assessing whether a topic at hand is a high-level or low-level latent and orienting accordingly. For example: if the topic at hand is “what will the societal response to AI be like?”, it’s by default not a great conversational move to talk about one’s interactions with ChatGPT the other day—those observations are likely too low-level[1] to be informative about the high-level latent(s) under discussion. Conversely, if the topic at hand is low-level, then analyzing low-level observations is very sensible.
(One could probably have derived the same every-day lessons simply from Bayes nets, without the need for natural latent math, but the latter helped me clarify “hold on, what are the nodes of the Bayes net?”)
But admittedly, while this is a fun perspective to think about, I haven’t got that much value out of it so far. This is why I give this post +4 instead of +9 for the review.
- ^
And, separately, too low sample size.
- ^
This looks reasonable to me.
It seems you’d largely agree with that characterization?
Yes. My only hesitation is about how real-life-important it’s for AIs to be able to do math for which very-little-to-no training data exists. The internet and the mathematical literature is so vast that, unless you are doing something truly novel, there’s some relevant subfield there—in which case FrontierMath-style benchmarks would be informative of capability to do real math research.
Also, re-reading Wentworth’s original comment, I note that o1 is weak according to FM. Maybe the things Wentworth is doing are just too hard for o1, rather than (just) overfitting-on-benchmarks style issues? In any case his frustration with o1′s math skills doesn’t mean that FM isn’t measuring real math research capability.
[...] he suggests that each “high”-rated problem would be likewise instantly solvable by an expert in that problem’s subfield.
This is an exaggeration and, as stated, false.
Epoch AI made 5 problems from the benchmark public. One of those was ranked “High”, and that problem was authored by me.
It took me 20-30 hours to create that submission. (To be clear, I considered variations of the problem, ran into some dead ends, spent a lot of time carefully checking my answer was right, wrote up my solution, thought about guess-proof-ness[1] etc., which ate up a lot of time.)
I would call myself an “expert in that problem’s subfield” (e.g. I have authored multiple related papers).
I think you’d be very hard-pressed to find any human who could deliver the correct answer to you within 2 hours of seeing the problem.
E.g. I think it’s highly likely that I couldn’t have done that (I think it’d have taken me more like 5 hours), I’d be surprised if my colleagues in the relevant subfield could do that, and I think the problem is specialized enough that few of the top people in CodeForces or Project Euler could do it.
On the other hand, I don’t think the problem is very hard insight-wise—I think it’s pretty routine, but requires care with details and implementation. There are certainly experts who can see the right main ideas quickly (including me). So there’s something to the point of even FrontierMath problems being surprisingly “shallow”. And as is pointed out in the FM paper, the benchmark is limited to relatively short-scale problems (hours to days for experts) - which really is shallow, as far as the field of mathematics is concerned.
But it’s still an exaggeration to talk about “instantly solvable”. Of course, there’s no escaping of Engel’s maxim “A problem changes from impossible to trivial if a related problem was solved in training”—I guess the problem is instantly solvable to me now… but if you are hard-pressed to find humans that could solve it “instantly” when seeing it the first time, then I wouldn’t describe it in those terms.
Also, there are problems in the benchmark that require more insight than this one.
- ^
Daniel Litt writes about the problem: “This one (rated “high”) is a bit trickier but with no thinking at all (just explaining what computation I needed GPT-4o to do) I got the first 3 digits of the answer right (the answer requires six digits, and the in-window python timed out before it could get this far)
Of course *proving* the answer to this one is correct is harder! But I do wonder how many of these problems are accessible to simulation/heuristics. Still an immensely useful tool but IMO people should take a step back before claiming mathematicians will soon be replaced”.
I very much considered naive simulations and heuristics. The problem is getting 6 digits right, not 3. (The AIs are given a limited compute budget.) This is not valid evidence in favor of the problem’s easiness or for the benchmark’s accessibility to simulation/heuristics—indeed, this is evidence in the opposing direction.
See also Evan Chen’s “I saw the organizers were pretty ruthless about rejecting problems for which they felt it was possible to guess the answer with engineer’s induction.”
This is close but not quite what I mean. Another attempt:
The literal Do Well At CodeForces task takes the form “you are given ~2 hours and ~6 problems, maximize this score function that takes into account the problems you solved and the times at which you solved them”. In this o3 is in top 200 (conditional on no cheating). So I agree there.
As you suggest, a more natural task would be “you are given time and one problem, maximize your probability of solving it in the given time”. Already at equal to ~1 hour (which is what contestants typically spend on the hardest problem they’ll solve), I’d expect o3 to be noticeably worse than top 200. This is because the CodeForces scoring function heavily penalizes slowness, and so if o3 and a human have equal performance in the contests, the human has to make up for their slowness by solving more problems. (Again, this is assuming that o3 is faster than humans in wall clock time.)
I separately believe that humans would scale better than AIs w.r.t. , but that is not the point I’m making here.
CodeForces ratings are determined by your performance in competitions, and your score in a competition is determined, in part, by how quickly you solve the problems. I’d expect o3 to be much faster than human contestants. (The specifics are unclear—I’m not sure how a large test-time compute usage translates to wall-clock time—but at the very least o3 parallelizes between problems.)
This inflates the results relative to humans somewhat. So one shouldn’t think that o3 is in the top 200 in terms of algorithmic problem solving skills.
Just want to say: This is among the best AI safety work I’ve seen, and am happy you did it!
Let me remind people of this other data point: in a previous paper with Evan we found that Opus engages in sandbagging. Here are screenshots of key parts:
The system prompt is natural (but does give the model a goal, unlike in Apollo’s case):
A “Slack conversation” that appears in the prompt includes this message:
And here’s the evaluation task.
Opus acted less capable than it is, both with and without chain-of-thought.
I really like the idea of creating a Good ABC book, and the author executed it well. Out of the recreational posts of 2023, this is the one I’ve liked the most. I can’t articulate why exactly, but the visuals are great, it Creates A Thing, and it’s an example of things I’d like there to be more in the world. It inspired me create my own version. I give it a +9 for the 2023 review.
I really liked this post. I also have friends who have young children, and was inspired to give them a book like this. But everyone involved speaks Finnish, so I ended up creating my own.
I just got my copies from mail. It looks really unimpressive in these low-quality phone-camera photos of the physical book, but it’s really satisfying in real life—like Katja, I paid attention to using high-quality photos. For the cover picture I chose Earthrise.
(I’m not sharing the full photos due to uncertainties with copyright, but if you want your copy, I can send the materials to you.)
More information about the creation process:
I didn’t know where to look for photos, but Claude had suggestions. I quickly ended up using just one service, namely Adobe Stock. Around 80% of my pictures are from there.
I don’t know where Grace got her pictures from, but would like to know—I think her pictures were slightly better than mine on average.
I found myself surprised that, despite doing countless school presentations with images, no one told me that there are these vast banks of high-quality images. (Instead I always used Google’s image search results, which is inferior.)
Finding high-quality photos depicting what I wanted was the bulk of the work.
I had a good vision of the types of words I wanted to include. After having a good starting point, I benefited a little (but just a little) from using LLMs to brainstorm ideas.
To turn it into a photo, I needed to create images for the text pages. Claude was really helpful with coding (and debugging ä′s and ö′s). It also helped me compile a PDF I could share with my friends. (I only instructed and Claude programmed—Claude sped up this part by maybe 5x.)
My friends gave minor suggestions and improvements, a few of which made their way to the final product.
Total process took around 15 person-hours.
I didn’t want to have any information like authors or a book title in the book: I liked the idea that the children will have a Mysterious Book in their bookshelf. (The back cover has some information about the company that made the book, though.)
Here’s the word list:
A: Avaruus (space)
B: Bakteeri (bacteria)
C: Celsius
D: Desi
E: Etäisyys (distance)
F: Folio (foil)
G: Geeni (gene)
H: Hissi (elevator)
I: Ilma (air)
J: Jousi (spring)
K: Kello (clock)
L: Linssi (lens)
M: Määrä (quantity)
N: Nopea (fast)
O: Odottaa (to wait)
P: Pyörä (wheel)
R: Rokote (vaccine)
S: Solu (cell)
T: Tietokone (computer)
U: Uuni (oven)
V: Valo (light)
Y: Ympyrä (circle)
Ä: Ääni (sound)
Ö: Ötökkä (bug)
- 11 Dec 2024 20:27 UTC; 3 points) 's comment on A to Z of things by (
There is no such function f; the output dimension needs to be at least 2n/2−1 for this to be possible.
Suppose that f:{0,1}n→Rd is such that f(S) and f(Sc) are linearly separable for any XOR-subset (subset of form {x∈{0,1}n:xi1⊕xi2⊕⋯⊕xik=0}). There are 2n such XOR-subsets. Consider the matrix M of dimension 2n×2n whose rows are labeled by x∈{0,1}n and columns by XOR-subsets S, with
Mx,S=1 if x∈S, else −1.
(I.e.M is a Hadamard matrix of size 2n×2n. We may assume M is symmetric.) The function f is such that, for any S, there exist wS∈Rd,bS∈R such that
sign(⟨wS,f(x)⟩+bS)=Mx,S
for all x. Thus, if we define vx=(f(x),1)∈Rd+1 and uS=(wS,bS)∈Rd+1, we have
Mx,S=sign(⟨vx,uS⟩).
The definition of sign-rank of a matrix M is the smallest dimension d+1 for which such a decomposition exists. A theorem by Forster implies that the sign-rank of M is at least 2n/||M||, whereas it’s well-known that the spectral norm of symmetric Hadamard matrices is √2n, That implies d+1≥2n/2.