I agree. I’ve been saying for awhile that LLMs are highly optimized to seem useful, and people should be very cautious about assessing their usefulness for that reason. This seems like strong and unambiguous positive evidence for that claim. And a lot of the reaction does seem like borderline cope—this is NOT what you would expect to see in AI 2027 like scenarios. It is worth updating explicitly!
I think that people are trying to say “look AI is progressing really fast, we shouldn’t make the mistake of thinking this is a fundamental limitation.” That may be, but the minimal thing I’m asking here is to actually track the evidence in favor of at least one alternative hypothesis: LLMs are not as useful as they seem.
gemini seemed useful for research and pushed me in the other direction. But lately there have been some bearish signs for LLMs (bullish for survival). Claude opus 4 is not solving longer time horizon tasks than o3. Agency on things like Pokémon, the experiment on running a vending machine, and net hack is still not good. And grok 3 is so toxic that I think this is best viewed as a capabilities problem which I personally would expect to be solved if AGI were very near. Also reasoning models seem to see INCREASED hallucinations.
My P(doom) has dropped back from 45% to 40% on these events.
I agree. I’ve been saying for awhile that LLMs are highly optimized to seem useful, and people should be very cautious about assessing their usefulness for that reason. This seems like strong and unambiguous positive evidence for that claim. And a lot of the reaction does seem like borderline cope—this is NOT what you would expect to see in AI 2027 like scenarios. It is worth updating explicitly!
I think that people are trying to say “look AI is progressing really fast, we shouldn’t make the mistake of thinking this is a fundamental limitation.” That may be, but the minimal thing I’m asking here is to actually track the evidence in favor of at least one alternative hypothesis: LLMs are not as useful as they seem.
gemini seemed useful for research and pushed me in the other direction. But lately there have been some bearish signs for LLMs (bullish for survival). Claude opus 4 is not solving longer time horizon tasks than o3. Agency on things like Pokémon, the experiment on running a vending machine, and net hack is still not good. And grok 3 is so toxic that I think this is best viewed as a capabilities problem which I personally would expect to be solved if AGI were very near. Also reasoning models seem to see INCREASED hallucinations.
My P(doom) has dropped back from 45% to 40% on these events.