I don’t want to seem like an Ant shill, but I must confess that at least a month before this post came out, I had a private conversation with one of their RL environment vendors about how Ant had put out new rules to prevent some of the behaviors mentioned above. And a few days after the post came out, Opus 4.7 was released, and I saw & confirmed that it improved on some of the behaviors mentioned (particularly the only-solving-a-portion-of-the-work problem). Was limited positive evidence about their alignment team’s ability & inclination to notice and fix value problems with each generation.
lc
This is gonna sound like the most caveman grug tier question. But like, would Vladimir Putin just force every woman on earth in his harem? Cause that sounds pretty bad from my CEV
Bought your options, right boys?
Dunno
Softened some of the language in the conclusion.
You can go a bit beyond the frontier on the margin with specialized RL environments and some specialized RLHF, but you can’t go substantially beyond the frontier (this is basically what the bitter lesson is about).
Seems like we just like disagree on the object-level question here, and also what the Bitter Lesson implies for this situation. My current impression is that the labs got most of their generalization during pretraining, and that the primary gains since 2024 have been due to specialized RL on tasks that doesn’t generalize well, and that the massive diversity of the environments that the labs go out of their way to procure reflects this. If what you say was true, why wouldn’t the labs just train mostly on Math and then expect the models to generalize their gains to Law and SWE? It’s a lot easier to make synthetic Math datasets and they wouldn’t have to spend billions building these arenas.
I’m not an expert though; this might be better resolved if someone at or near the labs just gave us their opinion on what % of the gains from training on SWE RL environments goes to software engineering and what % actually uplifts other tasks; I’m sure they’ve measured it.
“Functionally predict what X language model is going to do” is almost the quintessentially short horizon task that is easy to setup inside an environment. I do not know what proportion of spend Anthropic has allocated toward interpretabilityish tasks like this compared to SWE, or how effective they are at “conceptual research”, but you can spend money on that task and get a model to do a better job, that is just a fact. Likewise for other subtasks that matter for ensuring models are aligned, like reviewing transcripts and spotting misbehavior. Just like how, before the model companies have completely automated the full loop of what a software engineer does, they can do RL on verifiable subsections of the task like “implement this feature and don’t make any bugs.”
No, Eliezer is not making this argument
You’re right, I bastardized it in that comment in my haste, but I did not do so in the post. The AGI Ruin post says:
8. The best and easiest-found-by-optimization algorithms for solving problems we want an AI to solve, readily generalize to problems we’d rather the AI not solve; you can’t build a system that only has the capability to drive red cars and not blue cars, because all red-car-driving algorithms generalize to the capability to drive blue cars.
“Problems we want to solve” includes (but is not limited to) those subtasks. Maybe the model companies don’t have the right incentive structure to take advantage of this, but this sentence is “wrong, at least in the manner that’s relevant for us.” I can modify the post to make it clear that I’m not necessarily saying the model companies are going to use this as much as necessary.
That’s a different argument than the one Eliezer is making. Eliezer is saying that the ASI model you train is going to be great at everything no matter what. You’re arguing that it’s difficult/not incentivized to train an AI on alignment research, and so the model companies are in practice going to fail at that.I disagree that we can’t train an AI to do alignment research. It’s true that we can’t have models align an AGI inside an RL environment and give it a reward, but you can definitely train a model to e.g. perform interpretability research, catch instances of reward hacking, and predict what other models are going to do during deployment before you run them. All of the above are things that the labs train their models to do today.
Training the models to be good at software engineering is bad in the sense that it’s capabilities progress, but there’s a third category of things like “can take over the world” or “can manipulate humans into arbitrary actions” that are useful for takeover that models aren’t learning. “Maybe we can train a model to only help us with X, that is extremely hard but not sufficient to take over the world”, is the vague hope that I imagined Eliezer was originally responding to.
I think you’re misunderstanding 19(a). We have no idea whether the preference you impute to Claude in that conversation reflects a robust pointer to “latent events and objects and properties in the environment” rather than to its own sense data. And, more specifically to the point he was making, there is no publicly-known technique within the current paradigm of training LLMs that we have good reasons to believe instills preferences over environmental latents (the ground truth) rather than sense data (proxies), let alone any specific latents of our choosing...
I don’t think I’m misunderstanding it, but I am going to remove that section because I’m finding it difficult to articulate why I think this argument for danger is so weak, and you’re right that the current section is not conveying that & instead arguing against a strawman. It has something to do with how much this sounds like people when they say that models aren’t really intelligent because “all they do is predict the next token”, or when they claim the same thing about humans, that they’re ultimately just interested in sense-data instead of latents. I get that it’s not perfectly analogous, because models are potentially going to optimize these tiny differences until they bite us in the ass, but something feels weird about this line of argument.
Re: “particular alignment proposals” (under point 10): one problem here is that there are not that many concrete alignment proposals for superintelligent systems that don’t have known catastrophic flaws. As far as I can tell, Anthropic’s plan is “throw the kitchen sink of all the white-box and black-box methods we’ve developed at our models, and hope that’s good enough at the point where we’ve developed a model that we think can kick-start RSI (including coming up with its own novel alignment methods for future generations of models)”. The current slope of epistemically-justified assurance in model alignment, as reported by their system cards and the most recent Alignment Risk Update, is downwards. That is a bad direction for the slope to be pointing when we haven’t even hit RSI-capable models yet! The methods Anthropic is using to figure out whether their models are coherently misaligned rely substantially on models demonstrably lacking in the capabilities that would be necessary for them to cover it up if they were. We are starting to hit the point in model capabilities where this signal is getting less reliable. The techniques and evals are not keeping pace.
I have not read these PDFs but that all seems very possible.
Reevaluating “AGI Ruin: A List of Lethalities” in 2026
I feel like replying “you first.” You’re the one making all of these claims about how the FBI is involved in suppressing domestic political movements, and your only example is a raid that happened in the seventies.
Linkpost note: I originally avoided cross-posting this to Lesswrong because I felt readers would likely feel it was too political, but now that top posters have been prominently advancing arguments on the specific premise this article argues against, and engaging in related naked political posturing as predicted by this model, I felt it was timely, relevant, and well within revealed standards.
I wish there was more high quality discussion about hot-button political topics on LessWrong, and don’t know who you’re referring to, but strong downvoted this post anyways because it does a lot of silly posturing/attention seeking and the premises are ill-defined and seem ill-supported.
Been having some persistent issues editing post, incl. but not limited to the current situation where I just can’t connect:
I don’t usually browse twitter or Hacker News; but I wanted to hear what practitioners thought of the new model. And this is the first time I’ve learned that there’s this enormous glut of hackernews readers/regular engineers who are under the impression that 4.6 was nerfed since February. Is that something anybody here thinks actually happened, or is this just the weird reality of modern LLMs where people can hyperstition fears like this in response to nothing?
I misunderstood how the checkbox works and now the comments are inaccessible. Whoops.
Everybody who already responded: I did read your comments and am considering them/updating the post in response, using my squishy memory box.
Oof.
How do you “buy” info from another universe? They can’t respond.
Near term omnicide seems much more guaranteed in the world where the U.S. is not present to lend critical support to the Allies during WW2, or for the Jews to migrate to.
I just dismissed some of the already-responded-to comments and some of the comments I have yet to get to but will respond to inside the comment box.
Disco Elysium deserves to be on the primary list