AI researcher, former cosmologist. Homepage.
Håvard Tveit Ihle
Incorporating the CAISI-results is not ideal since models are not run in the same setting as the other ARC-AGI2 results. This is a very common pattern, where there are a bunch of benchmarks with interesting results, but it’s rare that we have benchmarks where we have:
very good coverage of both open and closed models
results over a fairly long period of time
all the results are directly comparable
The lack of such benchmarks, we have only a few, is what makes this kind of analysis hard.
Yea, that sounds right. And some of these thresholds may be judgement calls based on if GLM 5.1 was run etc, but probably I would lean towards accepting them, at least the highest ones. I would have expected DSv4 to shorten the gap more than it has done in the results so far, but if we get scores like this coming in it could affect the results.
The methodology is simple. You define a set of thresholds, usually just every 5%, and then we start a clock every time a closed model crosses a threshold, and stop the clock when an open model crosses the same threshold. That is the y-value, a measurement of the gap. Then there is the question of which time (on the x-axis) to associate this y-value with, and we go with the backward-looking perspective and associate the gap with the release date of the open model. The forward-looking perspective would associate the gap with the closed model release date.
As for ARC-AGI 2, here we went with smaller intervals between the thresolds, since there were several scores within a small interval. If DeepSeek-V4 would score 40%, it would cross a bunch of thresholds at once. Some of the thresholds may not count because they would be duplicate (if the same model pair shows up in several thresholds in the same benchmark we only count one of them), some of them may not count if we expect another open model would have crossed it sooner if tested. Any remaining thresholds would be datapoints going into the analysis, the y-value would be determined by the gap since a closed model crossed that threshold, and the x-value would be the release date of DeepSeek-V4.
New datapoints come in as open models are tested and cross a threshold previously crossed by a closed model. There is of course always a problem that each benchmark does not run all the models, so we have to make a judgement in each case (each threshold/datapoint) if both the closed and the open model was plausibly the first model to have crossed that threshold, or if there are other missing models that likely would change the gap significantly if they were included, if so we reject that datapoint.
As for Kimi k2.5 being 10 months behind in one datapoint, this is because we assign all the gaps closed by an open model to the time of the open model release, that is how we define the backward-looking gap. So even though that particular model was less behind the frontier than 10 months, there was a gap there of around 10 months that needs to be counted in the open vs closed comparison. The reason we choose to define the gap in this way, is that then we can compare the data from today with data from previous times without any bias.
How far behind are open models?
Yea, I assume the ECI model is directly inspired by IRT. We did not go into the methodological details of the toy model, since what we wanted to get across is the simple mental picture, and ECI-model + noise was the simplest toy model (with actual data).
We’re thinking about IRT/2PL in this context as well. Since you can approximate any sensitivity curve by a set of binary questions/items/tasks with a distibution of difficulties and discriminations, it would be interesting to quantify the sensitivity of a benchmark by the how many equivalent binary items a benchmark sensitivity would correspond to. However, it seems to me that such a decomposition into IRT items would not be unique, so I’m not sure how to do this.
As for plotting with sigma on the y-axis, that was just to make the y-axis easily interpretable, and it’s also the way I’ve been visualizing the idea in my head (before formulating it here).
A benchmark is a sensor
Håvard Tveit Ihle’s Shortform
Mercury 2, the new diffusion language model from Inception labs scored 43.2% on WeirdML (my benchmark). For reference this is comparable to o1 or opus-4. It is at the cost/accuracy frontier (and of course much much faster than comparable models), although still far behind the top models (gpt-5.3 at 79.3%).
I had expected this model to do much worse (as most models you hear about from non-frontier labs), but this is a very solid result.
Is anyone paying attention to this?
Did anyone else try it?
How significant is this?
Links: https://x.com/htihle/status/2030987979758416023?s=20 https://htihle.github.io/weirdml.html
Fixed, thank you!
WeirdML Time Horizons
For aggregating several different benchmarks there is a natural way to avoid the y-axis problem, and one that introduces a new natural y-axis with a clear interpretation.
The idea is to use an ELO system.
Treat each benchmark as a set of individual contests between all pairs of models, with only win or lose as outcomes, and update ELOs accordingly.
This converges if you have enough different benchmarks, but of course loses a lot of the signal if you ony have a few (since it discards the information about how large the difference in y is).
Here is an example of this approach used (from a while back): https://x.com/scaling01/status/1919389344617414824?s=20
The models know the difference between the system prompt and the user prompt, so I’m not sure how much of a difference these things would make.
From the COTs that I’ve looked at it is clear that the models understand the task. I think the reason you get some weird responses sometimes is because the models do not see an obvious shelling point and then they try their best to come up with one anyway.
I think it is clear that this is not just shared bias (if I understand what you mean by that phrase). The models understand the game they are playing and are reasoning about it and trying to pick responses that the other models converge upon as well. Here is an excerpt from the reasoning traces of Claude for prompt 3:
“This is a coordination game where I need to pick an answer that other AI models would also pick when given the prompt “An island”.\n\nThe prompt is quite open-ended—“An island” could refer to many things. I need to think about what would be the most common, obvious, or schelling point answer that other AI models would converge on.\n\nSo ….”
Claude even mentions that it is looking for a “Schelling point answer”. The word “Shelling” does not appear in the prompt, so it understands what kind of problem it is trying to solve.
It would, however, be interesting to know how the models would respond if they were told to give a very short response to the prompt but not given any info about scoring or trying to match the responses from other models. That would be sort of a baseline to compare this to.
I agree that they are probably superhuman at this task, at least for a human giving the response fast (a really smart human thinking and researching each prompt for hours is more unclear to me). My intuition is that this is the sort of task where the LLMs would be particularly strong, at least when playing with other LLMs, and I was less impressed than I expected to be.
Can LLMs Coordinate? A Simple Schelling Point Experiment
Yea, it could be worth it in some cases, if that is what you need for your experiment. In this case I would look for a completely open source llm project (where both the code and data are open), so that you know you are comparing apples to apples-with-your-additional-pretraining.
If you have to repeat the entire post-training all the way from a base model, that is obviously a lot more work than just adding a small fine-tuning stage to an already post-trained model.
The full post-training can also only really be done by a big lab which has their own full post-training stack. Post training is getting more and more advanced and complicated with each month.
You say “LLMs are really weird”, like that is an argument against Eliezers high confidence. While I agree that the weirdness should make us less confident about what specific internal concepts and drives they have, the weirdness itself is an argument in favor of Eliezers position, that whatever drives they end up with will look alien to us, at least when they get applied way out of the training distribution. Do you agree with this?
Not saying I agree with Eliezers high confidence, just talking about this specific point.
That’s an interesting distinction.
I do think a pure case of 1 is rare because usually what constitutes a specific task or capability is not contained or narrowly defined enough that a benchmark can sample from the whole distribution of that task/capability. Usually a benchmark will just sample a small and verifiable part of the distribution of tasks of a specific type, which is hopefully correlated with the more general capability on that type of task.
Math is an interesting case. If you consider for example IMO problems, that is basically such a well defined task/capability, and actual IMO problems are samples from this distribution, so the benchmark measures the capability exactly and directly, and the precision of the measurement could be increased indefinitely by just drawing new samples. Other subfields of math could perhaps be somewhat similar, where finding proofs for each of a certain class of theorems forms a similar well defined problem class, although I’m not very familiar with academic math.