I wonder if it’s partly because the descriptions are interdependent. As in, to keep them up to date, they’d need to update every description at each new release, since which is best for what will change.
AnthonyC
It’s only been a year since their best and worst models were named o4 and 4o, respectively. They still don’t seem to care much whether their model names make sense or are useful.
Excellent post, but I roll to disbelieve that this is fully true about cats and dogs. I accept that the average cat and dog fail it. But our pets vary widely in intelligence. I used to dog sit, and there were definitely a few, maybe the top decile at most, that used mirrors as tools (for example, if I spoke to them while they were facing the mirror, they’d look at my reflection instead of me). I never tried the mark test with them.
In the world when AI is unambiguously better than humans at writing of all sorts and styles, there’s enough economic disruption that I’m hesitant to make positive claims about what incentives will or will not motivate anyone to do anything.
As long as you are not completely alone in preferring human written fiction, the existence of “better” fiction is irrelevant. You are the market, you are the judge, and writers don’t need that large of an audience to supply, if it’s a dedicated one. Many great writers would also write whether they got paid or not, some whether anyone reads it or not, and in an AGI world that could be more true rather than less.
I’m sure there is still data it would need to collect. I think it’s a mistake to use the amount and type of data humans require as a guide to what that might mean.
Yes. I doubt it would be practical for ASI to solve biology by simulating fundamental physics of various 50-100kg lumps of atoms and seeing what matches available data on humans. I also doubt itbwould need anywhere remotely close to the number of experiments we need to draw the lessons it needs to solve any particular biology problem.
3-2 AI preference here. The human passages I preferred were the oldest ones, which I don’t think is a coincidence either.
On yeast: I know most sources I’ve read say shorter timelines, but I keep my instant yeast in the freezer, and generally it still works just fine even after >5 years.
Yeah. Given the level of variance among humans we know significant variance is possible even within the strict bounds of “minds that run on 3 pounds of meat and 20 watts of sugar, most of which is spent on things other than thought.” I find this to be a pretty strong argument that the limits at which we hit diminishing returns should be rather far away.
They other side of my mental model is, if (our best model of) the laws of physics fit on a postcard, what exactly does it mean to need to do an experiment, in principle? You need experiments to nail down the laws. Beyond that, they’re convenient for reducing computational requirements, often vastly so, but it’s not something that prevents you from getting things right on the first try way more often than humans do.
My charitable musing is that maybe Dario genuinely hasn’t met anyone whom he judged as smarter than himself by a wide enough margin to develop this intuition via humans, which I think is the somewhat easier path to really feeling the possibility internally than approaching it abstractly/intellectually provides.
I suspect such evidence would involve giving away more details of Anthropic’s internal processes and workflows than Anthropic would be comfortable with.
True, but I doubt very many employees will openly state they were not able to do any work before Mythos.
I agree in general, but presumably this would result in the company redistributing resources towards whatever the now-most-critical-bottleneck activities are? Maybe that’s impossible for humans at the current pace of AI development (organizations are usually not this responsive and adaptable)? Alternatively, couldn’t accelerating the acceleratable activities plausibly cause bigger gains per model iteration in ways that might subsequently loosen remaining bottlenecks?
I did specifically say that I was worried cyber wasn’t a major category in RSPv3. Maybe it’s time to admit this was a mistake? It’s kind of wild that they did that, when at the time of the issuance of RSPv3 they knew about Claude Mythos.
Is there a chance we should be drawing the opposite conclusion? That Anthropic updated the RSP knowing Mythos and all future models could not possibly be safely and responsibly released under current conditions because of their cyber capabilities, and offloading that concern from the RSP (since it’s no longer about scaling) to Project Glasswing?
While yes there are many cases of evolution not finding specific compounds, biology does often make excellent use of complex microstructures to tweak performance in ways humans struggle with, since our assemblers are not molecular. Different search functions operate under different process constraints and explore different parts of the solution space.
Bone combines soft and hard domains (herringbone structure) to improve overall strength and impact resistance—researchers have adapted this to use in human-made plastics using 3D printing but it’s not commonly done since it’s not easy to make at scale.
The mantis shrimp’s dactyl club combines a similarly microstructured impact resistant mineral outer layer with a layered, helicoidal chitin fiber composite structure (10-15 degree interlayer angle) to dissipate energy, and a striated parallel chitin fiber inner layer to act as a cushion. Humans can do better with the most advanced carbon fiber composites we can make, but there was a point not too long ago when we couldn’t.
Biology also famously has a hard time making stable deep blue pigments, and an easier time making things blue using structural color, or by isolating pigments in places where they can be made blue by carefully adjusting pH. (Fun fact: gold nanoparticles of different sizes are one of the ways people used to make stained glass in different colors).
Thank you for posting this, and being so forthright about it.
I know everyone here understands exponentials far better than almost everyone on the planet, but Anthropic started ended last year already big enough to be on the 2026 Fortune 500. They’ve been growing at just under 50%/month. This would make them the largest company in the world around mid-November.
For any other industry or product, I would naturally assume there’s a logistic curve that would level out before that happens and slow things down, even if they do ultimately take the top spot. In this case I honestly don’t know what I expect to happen this year. Doomsday 2026 seems like way less of a joke nowadays than I’d like.
I have been on both sides of those kinds of divides with other humans, yes.
I do agree with that. It just seems to me like the solution at some point becomes “Run a pilot study on a comparable real-world task that needs doing anyway” rather than “Develop a standard benchmark.”
That’s true
Not terribly important, but since that was Scott quoting the Principia Discordia, probably better to just cite the original.