That must have been the one I read and which informed my decision to minimise daycare hours for my son.
p.b.
I read an anecdote about a nobel prize winner (maybe Nils Bohr?) who went on walks with his two sons and they played battleship against each other in their head, while he was the arbiter keeping both setups in his head. That always sounded impressive to me because it seems so off the cuff.
When I went on walks with my girlfriend and my roommate we used to play chess against each other. Also just in our heads each against each, so each playing two games at once. Despite this being much more complex than the battleship example, it doesn’t impress me much, because I know that this ability comes from playing lot’s of chess and is the norm in strong chess players.
My roommate could also remember long random strings within seconds, very impressive, but based on memo-techniques, i.e. learnable.
When I had dinner with a famous scientist, he mentioned something growing at x%. I said, ok, so it doubles every y years. He gave me a surprised look, but it’s just a trick (dividing log(2) by the growthrate gives the approximation of the doubling time).
Point is: It is hard to use anecdotes as evidence for unusual intelligence. People can learn many impressive seeming abilities. Everything loses luster once you know how the sausage is made.
This reminded me of the lord of the rings—first you get 200 pages about the genealogy of Meriadoc Brandybuck and how one of his paternal great-grandfathers had big hairy feet, then it turns out the story is about some other dude who stares into space a lot.
I enjoyed it because it also reminded me of my math study days in <redacted> city, where all the boys were smart and all the girls were … also smart.
I am however confused about the title: How does solving an inconsequential puzzle in the most inefficient way possible showcase “returns to intelligence”.
I read a review about daycare which came to the conclusion that daycare before the age of two had clearly negative effects on the kids, between two and three it was mixed, and after the age of three it was positive (no daycare wasn’t an option for us but this was the reason we reduced hours as much as possible). The reason given was that it is stressful for little kids to be separated from their mom. But now I wonder whether infections don’t play a bigger role than that. Luckily my kids are pretty robust in that regard.
Does it also give you more of a kick?
I replaced coffee with a guarana mixture for stomach reasons, but I really miss the coffee hit (and the smell, taste, preparation and everything else about coffee but what can I do).
I mention Opus 4.6 because it is the predecessor model and this allows a comparison between the numbers that pop out of my analysis and the “official” METR values.
My analysis at least recovered the exponential improvement of time horizons with similar doubling times as the METR analysis, but the concrete values depend on modelling assumptions.
If I find the time I might write it up after all, but here is a short sketch:
Two assumptions:
The logistics fitted by METR tend to have quite similar slopes (at least the later models), so I take the average slope for my fit.
The task time completions of SWE-bench verified are log-normally distributed, I derive the concrete distribution from commit timestamps by cleverly trying to correct for pauses. Here different modelling assumptions don’t change the trend but can change the time horizon values.
With the slope and the distribution I can find for each percentage the position of the logistic which gives me the time horizons.
I think the realistic assumption is that many people state this because it goes against the current vibe that Claude is better. Those that prefer Claude do not feel the need to belabour the obvious.
My own experience is Opus > GPT-5.4 > Sonnet but Claude seems a lot better at data analysis and GPT-5.4 probably has its own areas of relative dominance.
My model takes the average slope of earlier logistic curves. If for some reason the logistic fitted for Mythos is much less steep than for earlier models, the ratio of the time horizons could be different. Have to wait for a task level analysis to see that.
Just by looking at the benchmark scores you can see that it is very off-trend. But of course error bars for such long time horizons (even done with much better methodology than mine) are huge.
I never published it because it seemed to be clear that SWE-bench verified saturates well below 100% and without a known saturation point the derived time horizons could really be anything. I wasn’t even sure getting 93.9% was possible.
Of course one doesn’t need to transfer the 93.9% to time horizons to see that this is a huge discontinuous jump.
I ran the 93.9% percent on SWE-bench verified by Claude Mythos through my analysis that estimates time horizons from percentage scores based on the task time distribution derived from commit timestamps.
Compared to Claude Opus 4.6′s 80.8% this pushes the imputed 50% time horizon from 6h to 34.4h and the 80% time horizon from 1.9h to 11h.
As a counterpoint:
I think utility function is significantly broader than character. You can fold anything into a utility function! I also think talking about utility functions only makes sense if the system has that explicitly.
And of course goals are a completely different thing from character! Goals are not “stable behavioural dispositions”.
We know empirically that llms have “stable behavioural dispositions”. Everybody who has interacted with a bunch of different models knows that.
I also think that character traits are in many cases priors about the world, your place in the world, your relation to other agents in the world, etc: Is everyone out to get you? Does it make sense to be slow and careful or should you move fast and break things? Is energy expensive or do you need to get shit done? Depending on which priors you have, you are trusting or not, conscientious or not, lazy or not.
These are not human specific, any kind of agent can have them.
Directly. Given that they are then priced at roughly 66% for our customers I guess the insane markup goes to my company.
The recommendation is something like 10 gram and I probably roughly take that, but relatively spread out over the day (I mix it into a hot beverage). But as I said, I very rarely have any problems sleeping, so I don’t know what a positive effect on that would have looked like.
Very interesting!
I started taking collagen half a year back and didn’t notice improved sleep. But my sleep is also hard to improve upon. I had less joint pain, which I was taking it for, but could also be explained by regression to the mean.
The main effect I am pretty sure about is that my finger nails started growing like crazy.
Good to know there are other reasons it might be a good idea to keep taking it.
Yeah, I guess there is just more friction than one would expect. I also found out that Slovaks are now more like 66% of the price of Germans, so either the margin is great or the income gap is smaller than I thought.
Also: Ease of communication is hard to overrate. I always enjoy working with people who have a very similar background compared to me (i.e. similar milieu growing up, not just same country), communication is so smooth.
Actually it is usually a few days. But I am rarely sleep deprived so the reason is usually something that is probably slower to change than sleep deprivation like infections or stress.
My experience from playing a lot of online chess is that tiredness, exhaustion, illness etc doesn’t necessarily immediately crash my performance. Often I feel bad but still play well. Performance then crashes over the following days.
I think one could explain the experiment to them. That might be convincing enough that they actually try to think of a number.
“These values are hard-coded instead of being extracted from the groundtruth excel, that can’t be right.”
No shit, Claude …
Pray tell, who might have faked these groundtruth values when he couldn’t find a way to extract checkmarks from the excel?