Is there already an METR evaluation of Claude 4?
p.b.
I read that this “spoiled meat” story is pretty overblown. And it doesn’t pass the sniff test either. Most meat was probably eaten right after slaughter, because why wouldn’t you?
Also herbs must have been cheaply available. I also recently learned that every household in medieval Europe had a mother of vinegar.
What LLMs lack
I played a game against GPT-4.5 today and seemed to be the strongest LLM I have played so far. Didn’t hallucinate once, didn’t blunder and reached a drawn endgame after 40 moves.
What helps me to overcome the initial hurdle to start doing work in the morning:
Write a list of the stuff you have to do the next day
Make it very fine-grained with single tasks (especially the first few) being basically no effort.
Tick them off one by one
Also:
Tell people what you have to do and when you are going to do it and that you have done it. Like, a colleague, or your team, or your boss.
Do stuff with other people. Either actually together, like pair programming, or closely intertwined.
I think it also helps to take something you are good at and feel good about and in that context take responsibility for something and/or interact with/present to people. Only this kind of social success will build the confidence to overcome social anxiety, but directly trying to do the social stuff you feel worst about usually backfires (at least for me).
Which is exactly what I am doing in the post? By saying that the question of consciousness is a red herring aka not that relevant to the question of personhood?
No.
The argument is that feelings or valence more broadly in humans requires additional machinery (amygdala, hypothalamus, etc). If the machinery is missing, the pain/fear/.../valence is missing although the sequence learning works just fine.
AI is missing this machinery, therefore it is extremely unlikely to experience pain/fear/.../valence.
On AI personhood
It’s probably just a difference in tokenizer. Tokenizers often produce tokens with trailing whitespace. I actually once wrote a tokenizer and trained a model to predict “negative whitespace” when a token for once shouldn’t have a trailing whitespace. But I don’t know how current tokenizers handle this, probably in different ways.
I originally thought that the METR results meant that this or next year might be the year where AI coding agents had their breakthrough moment. The reasoning behind this was that if the trend holds AI coding agents will be able to do several hour long tasks with a certain probability of success, which would make the overhead and cost of using the agent suddenly very economically viable.
I now realised that this argument has a big hole: All the METR tasks are timed for un-aided humans, i.e. humans without the help of LLMs. This means that especially for those tasks that can be successfully completed by AI coding agents, the actual time a human aided by LLMs would need is much shorter.
I’m not sure how many task completion time doublings this buys before AI coding agents take over a large part of coding, but the farther we extrapolate from the existing data points the higher the uncertainty that the trend will hold.
Estimating task completion times for AI-aided humans would have been an interesting addition to the study. The correlation of the time-savings through AI-support with the task completion probability by AI coding agents might have allowed the prediction of the actual economic competitiveness of AI coding agents in the near future.
I meant chess specific reasoning.
I occasionally test LLMs by giving them a chess diagram and let them answer questions about the position ranging from very simple to requiring some calculation or insight.
Gemini 2.5 Pro also impressed me as the first LLM that could at least perceive the position correctly even if it quickly went off the rails as soon as some reasoning was required.
Contrary to manufacturing I expect this to get a lot better as soon as any of the labs makes an effort.
Let’s instead assume a top engineer has a really consequential idea every couple of months. Now what?
Speeding up implementation just means that you test more of the less promising ideas.
Speeding up feedback might mean that you can hone in on the really good ideas faster, but does this actually happen if you don’t do the coding and don’t do the math?
Do you plan to evaluate new models in the same way and regularly update the graph?
Yes, you are right. I overstated my case somewhat for these simple scenarios. There were also earlier results in that direction.
But in your work there probably already is an “unsafe code” activation and the fine-tuning only sets it to a permanent “on”. It already had the ability to state “the unsafe code activation is on” before the fine-tuning, so maybe that result isn’t very surprising?
There probably isn’t an equally simple “discriminate in favour of Canadians” activation, though I could imagine more powerful models to also get that right.
My examples are orders of magnitude harder and I think a fundamental limitation of transformers as they are currently trained.
I find this possible though it’s not my median scenario to say the least. But I am also not sure I can put the probability of such a fast development below 10%.
Main cruxes:
I am not so sure that “automating AI research” is going to speed up development by orders of magnitude.
My experience is that cracked AI engineers can implement any new paper / well specified research idea in a matter of hours. So speeding up the coding can’t be the huge speedup to R&D.
The bottleneck seems to be:
A.) Coming up with good research ideas.b.) Finding the precise formulation of that idea that makes most sense/works.
LLMs so far are bad at both. So I currently only see them scouring the immediate neighbourhood of existing ideas, to eke out incremental progress in the current paradigm.
Is that enough? Is an LLM building on a base model that has a loss close to the irreducible loss AGI? I.e. does accelerating this improvement matter for the transition to AGI and superintelligence?
I think not even the authors believe that. So they make the leap of faith that accelerated research will make a qualitative difference too. I think there are additional gaps between human cognition and LLMs beyond recursive reasoning in latent space and sample efficiency.
Will all those gaps be closed in the next few years?
This but unironically.
To answer my own question: They usually don’t. Models don’t have “conscious access” to the skills and knowledge implicit in their sequence prediction abilities.
If you train a model on text and on videos, they lack all ability to talk sensibly about videos. To gain that ability they need to also train on data that bridges these modalities.
If things were otherwise we would be a lot closer to AGI. Gemini would have been a step change. We would be able to gain significant insights in all kinds of data by training an LLM on it.
Therefore it is not surprising that models don’t say what they learn. They don’t know what they learn.
I think Sailer had it right 30 years ago. It’s mostly just behavioral and physical masculinity/femininity. That may be unfair, but it’s not racism.