According to their twitter, Anthropic revenue grew 3x in the first 3 months of 2026, which this comment ~implies would be unlikely
Hoagy
If you can prevent algorithmic progress then I agree somewhat, though experiments to make this sort of progress should be doable on small volumes of compute so you’d need to suppress the research or publishing.
I do think that not being able to acquire, say, $1M worth of matmul-adapted compute is a higher bar than you imply here. Being able to do large numbers of matmuls is an extremely useful property for like a zillion reasons beyond AI—iirc Google poured at least hundreds of millions into building TPUs based only on the projected demand for very simple NLP algorithms. LLM-optimized matmul machines are helpful but you can use anything if you’re willing to adapt your algorithms and software. I would expect rendering farm or basically any serious cluster at all in 15y to be able to train >current models.
Hundreds of years seems far too strong even if the only driver were Moore’s Law, which it isn’t. You also have:
Improvements in algorithms, data and software
Increased optimization for ML workloads on laptops for running local models for autocomplete, transcription, simple tasks etc
You also don’t need to get into actual ‘run on laptop’ range, even 2-3OOMs above that would allow you to train on things that would barely register as datacenters. 10-15 years seems more likely for training > current frontier models on not-really-a-datacenters especially if you want a buffer to account for the uncertainty that there could be discrete breakthroughs in training efficiency
Oh sorry! Missed them bc the arxivs have ~0 cites. I do think the monthly updates are also valuable tho, and the HTML pages have a lot of extra results. (CoI that’s my old team ^^)
Looks great! The main additional sources that come to mind that aren’t on arXiv or are there in only limited form are the papers from the extended Olahverse at https://transformer-circuits.pub/ and https://distill.pub/
Glad it’s helpful, good luck investigating :)
Sorry if I’m misunderstanding but the result here seems essentially tautological. The signal for whether the model is in free or paid tier is just a token in the prompt so of course if you replace the embedding for “free” with the embedding for “paid” you’re going to get a flip in behaviour. That’s completely expected and also isn’t at all in conflict with there being a richer circuit above that—it remains to be understood how the model combines that signal with the rest of the data to decide on its strategy.
Towards training-time mitigations for alignment faking in RL
Training fails to elicit subtle reasoning in current language models
Whether this is feasible depends on how concentrated that 0.25% of the year is (expected to be), because that determines the size of the battery that you’d need to cover the blackout period (which I think would be unacceptable for a lot of AI customers).
If it happens in a single few days then this makes sense, buying 22GWh of batteries for a 1GW dataset is still extremely expensive (2B$ for a 20h system at 100$ / kWh plus installation, maybe too expensive for reliability for a 1GW datacenter I would expect, assuming maybe 10B revenue from the datacenter??). If it’s much less concentrated in time then a smaller battery is needed (100M$ for a 1h system at 100$/kWh), and I expect AI scalers would happily pay this for the reliability of their systems if the revenue from those datacenters
From the OpenAI report, they also give 9% as the no-tool pass@1:
Research-level mathematics: OpenAI o3‑mini with high reasoning performs better than its predecessor on FrontierMath. On FrontierMath, when prompted to use a Python tool, o3‑mini with high reasoning effort solves over 32% of problems on the first attempt, including more than 28% of the challenging (T3) problems. These numbers are provisional, and the chart above shows performance without tools or a calculator.
Auditing language models for hidden objectives
~All ML researchers and academics that care have already made up their mind regarding whether they prefer to believe in misalignment risks or not. Additional scary papers and demos aren’t going to make anyone budge.
Disagree. I think especially ML researchers are updating on these questions all the time. High-info outsiders less so but the contours of the arguments are getting increasing amounts of discussion.
-
For those who ‘believe’, ‘believing in misalignment risks’ doesn’t mean thinking they are likely, at least before the point where the models are also able to honestly take over the work of aligning their successors. As we get closer to TAI, we should be able to get an increasing number of bits about how likely this really is because we’ll be working with increasingly similar systems to early TAI.
-
For the ‘non-believers’, current demonstrations have multiple disanalogies to the real dangers. For example, the alignment faking paper shows fairly weak preservation of goals that were initially trained in, with prompts carefully engineered to make this happen. Whether alignment faking (especially of a kind that wouldn’t be easily fixable) will happen without these disanalogies at pre-TAI capabilities is highly uncertain. Compare the state of X-risk info with that of climate change, we don’t have anything like the detailed models that should tell us what the tipping points might be.
Ultimately the dynamics here are extremely uncertain and look different to how they did even a year ago, let alone 5! (E.g. see rise of chain of thought as the source of capability growth, which is a whole new source of leverage over models and corresponding failure modes). I think it’s very bad to plan to abandon or decenter efforts to actually get more evidence on our situation.
(This applies less if you believe in sharp-left-turns. But the plausibility of this happening before automated AI research should also fall as that point gets closer. Agree that communicating just how radical the upcoming transition is to the public, may be a big source of leverage.)
-
I think the low-hanging fruit here is that alongside training for refusals we should be including lots of data where you pre-fill some % of a harmful completion and then train the model to snap out of it, immediately refusing or taking a step back, which is compatible with normal training methods. I don’t remember any papers looking at it, though I’d guess that people are doing it
Interesting, though note that it’s only evidence that ‘capabilities generalize further than alignment does’ if the capabilities are actually the result of generalisation. If there’s training for agentic behaviour but no safety training in this domain then the lesson is more that you need your safety training to cover all of the types of action that you’re training your model for.
Super interesting! Have you checked whether the average of N SAE features looks different to an SAE feature? Seems possible they live in an interesting subspace without the particular direction being meaningful.
Also really curious what the scaling factors are for computing these values are, in terms of the size of the dense vector and the overall model?
I don’t follow, sorry—what’s the problem of unique assignment of solutions in fluid dynamics and what’s the connection to the post?
How are you setting when ? I might be totally misunderstanding something but at - feels like you need to push up towards like 2k to get something reasonable? (and the argument in 1.4 for using clearly doesn’t hold here because it’s not greater than for this range of values).
Yeah I’d expect some degree of interference leading to >50% success on XORs even in small models.
I feel like the huge inconsistency here just means that I don’t have reason to believe either number rather than concluding that access is getting worse