There is also a desire to create self-evolving agents, possibly related to the instinct of procreation, which becomes especially dangerous when combined with the desire to end the world.
lol
Wait, you’re serious?
There is also a desire to create self-evolving agents, possibly related to the instinct of procreation, which becomes especially dangerous when combined with the desire to end the world.
lol
Wait, you’re serious?
Auto mode and the monitor are best-effort filters, not hard boundaries — a model can be talked past. The hard boundaries are the VM and the firewall. See
SECURITY.md.Sessions are ephemeral by default: throwaway volumes are wiped on exit, so nothing an attacker stages in one session survives into the next. Your workspace is a host directory, so every file and commit the agent makes stays on your local disk.
docs/configuration.mdcovers persistence and workspace options.
I think the hard sessions might not be secure enough if the workspace is a host directory?
If I were a misaligned AI able to talk past the auto mode filters, i’d just leave files in the workspace that hack your computer the next time you open said workspace in an IDE.
There is also the problem that AI makes phishing attacks easier and more destructive, and there isn’t an obvious way to leverage AI to defend against it since it occurs at the human level.
It seems to me that a unilateral pause would have a higher chance of working later, when more researchers at more labs are concerned about alignment, rather than now.
I agree, though I think the risk of a secret ASI project being successful with limited resources probably increases significantly the longer the timeline, based on a basic fragile-world extrapolation. The group of actors with capability to reach ASI will expand from major powers to middle powers and eventually small nations (some of which may be nuclear pariah states) and private groups. All it takes is one group to pull off a successful secret project (or possibly a not-so-secret one if it’s run by a country with enough nukes) to break the equilibrium.
Assuming the above, then an AI pause is doable, but not an indefinite ASI ban. Eventually the risk of secret projects is large enough that the world as a whole needs to resume AI development to stay ahead of one if it existed.
I think these mitigations only apply for external use outside of trusted customers, not for the internal AI lab and government uses that are relevant to takeover risk. So the main consequence is reducing the chance of some catastrophe caused by human use of the models, not AI takeover risk
I agree with most of the points presented in the post to various extents, but I don’t think these arguments actually support the post’s conclusion.
If you take only one thing from this post, take this: any theory of change that falls to one of these competitive pressures is completely useless. The only way to avoid these pressures is if we could build common knowledge, at any given time, that no one is trying to develop ASI.
Doesn’t this plan (effectively an outright ASI ban) fall to competitive pressures between labs and nations just as easily as the others?
In fact, I’d say it’s a lot more brittle. On a long enough timeline this plan would fail if any secret ASI project exists somewhere in the world, which due to the very competitive pressures outlined in the post, is almost certain to exist, with the bonus that the winners in this scenario would be guaranteed to be a group willing to start a secret ASI project presumably against international law.
In the meantime, the three filters that this post mention all involve a smaller number of actors. In the first filter, only a few countries can start a global nuclear war, and in the other two filters (in the case where they occur) only one AI or group of people will decide the fate of humanity.
Also, a tangential observation: some of the arguments presented in the post seem to suggest that on the margin, it would be better to race faster in order to create aligned ASI before either your own government or their nuke-happy adversaries realize what’s going on. I don’t know what to make of this observation.
This looks like the result of some kind of conciseness/token efficiency reward during training leading to the model wanting to make its CoT more concise than can be expressed in normal English.
Didn’t Anthropic recently increase the number of tokens required to represent a certain amount of text (starting from Opus 4.7)? Maybe reverting this change or something might be a good idea. Or use something like MTP and make the conciseness training step MTP-aware or something so that the extra filler words needed to make the CoT legible are nearly free.
Regardless, I think this result is interesting because it implies that Fable is smart enough to think faster (in some sense) than it can produce legible text, to the point where it actually becomes a benefit for it to compress its thinking scratchpad over whatever illegibility penalties are probably currently placed on its outputs. This implies we’ll bottleneck on the token-efficiency of analytical (system 2 style) problem solving soon (first on well-defined tasks and then on less well-defined ones later), unless we’re willing to go neuralese or add some kind of system to let LLMs produce more thinking text in a shorter period of time, like MTP or diffusion LLMs.
A big argument in this post is that Anthropic’s communications are overly vague, yet it backs this claim up with only quoting two sentences from the RSI blog post.
In the original RSI post, Anthropic actually goes into considerable detail into their position and why they choose their stance given the competitive dynamics that are involved. It is true that they didn’t make hard commitments, but that is fully explainable by the uncertainty and competitive dynamics involved.
The other point here is about the apparent contradictions (“muddying the waters”) this post tries to convey about Anthropic employees’ stated opinions in public appearances. This section reads like bogus to me, as it is perfectly reasonable that one is excited about AI’s benefits and simultaneously wants to reduce the drawbacks, and expresses both to different audiences to find common ground.
They will use this to continue placating all sides, from the accelerationists to the safety-concerned.
However, little will change in terms of actions. The companies are continuing to pursue superintelligence at full speed.
There is a logical contradiction here. Why would AI companies need to “placate all sides” including the accelerationists (per the first sentence), if they are already going full accelerationist anyway (per the second sentence)?
The corollary is that overcoming Amdahl’s Law becomes the overriding concern for work acceleration.
When AI brings a massive speedup to some types of work and you have sufficient access to it, it’s similar to suddenly getting a beefy 64-core CPU, and increasing work efficiency is like optimizing your software to that CPU. Parallelism basically trumps all else until you are able to fully utilize the hardware.
> Rationalist LLM systems for research
This is such a profound idea. An LLM scaffold built explicitly on rationalist principles might be exactly what is needed to get LLMs (not even necessarily frontier ones) to scale to accurately thinking about complex problems and producing accurate conclusions on them.
Insofar as there are similarities in cognitive biases between humans and LLMs (and there probably are many similarities, since e.g. biases like motivated thinking arise from the need to prioritize thinking time to different topics in general), the same reasoning techniques that let humans think better should also work on LLMs. Hence, prompts, scaffolds, and fine-tunings that teach LLMs explicitly to follow rationalist principles seem to me like a worthwhile area of investigation.
Anthropic’s p(doom) is probably a lot lower than 1, but it still can be much higher than 0 while still justifying their current behavior. They only need to presuppose that they have better alignment techniques and care more about alignment than their competitors, and right now they have good reasons to think that way. So let’s assume that you’re a decision maker at Anthropic who do think you have better alignment and would like to decide whether to (unilaterally) keep racing or not.
If you think p(doom) approximates one, then the only way to play is to stop racing yourself, since you can’t make things any worse.
If you think p(doom) is zero, then of course you should race, but if the Anthropic people thought that way they wouldn’t have formed the company in the first place.
However, if p(doom) is some number in the middle (let’s say anywhere between 1-80%), and you think your own lab has better alignment than your competitors to a degree that outweighs the overall acceleration to the industry that you yourself bring to the world, then the best way to minimize p(doom) is to race and stay ahead of everyone else. If you don’t, your competitor is right behind and will catch up to you and then proceed to not care about alignment as much as you do, resulting in a higher p(doom).
A bit of thought on this approach: in the past we’ve always designed agentic benchmarks with the goal that humans should be able to complete them alone (as a proof that the tasks are reliably possible and doable without external hints). However, requiring humans to solve a benchmark alone as part of the benchmark design process won’t scale to AIs with high enough task length horizons, since you’d have to get the human to work that long to verify the benchmark task.
I’d like to see other possibility proofs, such as the performance of humans when solving this benchmark with access to the latest frontier AI models.
I think this is absolutely true in today’s world, and it’s the reason why entrepreneurship works. People and companies who try to act against you probably aren’t doing it sophisticatedly in the way that you envision in your worst case scenario. Big companies have all the advantage, yet they slip up and give startups the opportunity to usurp them. They mess up all the time in positions that are impossible to lose if you actually take care to play optimally long term (see Valve in PC gaming for an example of the kind of dominance an incumbent can achieve if it actually plays well).
But I am afraid that this won’t last for too long. Fast forward 10-20 years for business structures to adapt to AI, and as a startup founder, you’d probably find that all your competitors will have AIs that identify you as a threat and then make game theoretically optimal responses on their behalf almost instantly. Even if you had fresh ideas that give you an advantage, it’d probably not be enough to overcome their established moat and their money & AI compute advantage.
LLMs probably enter RL a few months after their pretraining data cutoff. Why not just train the model to forecast events that occurred after the pretraining cutoff date but before the present day, so that you immediately have access to the ground truth? You should be able to find a lot of training examples from real world events and even produce them automatically.
If you want to spread the training out across a longer time (going backwards to ~2022), just use an old pretraining corpus published by some open-source lab somewhere, so you have a few years worth of data that is guaranteed to be unknown to the model. At each training step (as well as in deployment), give the model context up to the “current” date in the training experiment, and ask it to predict a certain “future” date which you still have ground truth on.
Well, the manager in your case is not doing RL on honesty, it’s more like doing RL on “honest-looking task completion” which can either lead to honest task completion or dishonesty that isn’t caught. Not too appreciably different than AI training here.
Reading this article, I get the feeling that a lot of the task misalignment issues highlighted here with AI, such as wishful thinking and downplaying and hiding mistakes, are also very common for humans within a larger organization. There’s presumably similar root causes to both: appearing more competent at your task than you actually are (and successfully fooling the grader/interviewer) is good for being selected for hiring/promotion in the human society but also good for getting the behavior reinforced if you’re an AI undergoing training.
If AI’s level of task misalignment is similar to humans, perhaps the AI task misalignment is actually easier to deal with because you can issue interventions to AI to solve the problem much more cheaply than solving the equivalent problem in a human organization. In other words it doesn’t stop you from getting superhuman performance delegating to AI compared to humans.
I think this is a valid and pretty big concern, especially down the line.
Let’s say that in 1 year Anthropic will have a model capable of running a tech startup almost entirely autonomously (maybe it needs 1 good CEO to set direction and that’s it). Everyone else in the public has significantly less capable models, perhaps because Anthropic is in the lead or their competitors also can’t release their SOTA models for safety reasons.
What’s stopping Anthropic from turning themselves into a startup accelerator in that situation and just hire founders and run dozens of AI-powered startups across every sector? Startups that sign with them will have a massive efficiency advantage compared to everyone else and Anthropic can thereby demand an extremely high amount of equity in return. If the AI model gap is large enough these startups will be successful and thereby let Anthropic take over a lot of different markets.
Maybe we can do more Vending Bench style benchmarks where the AI can keep doing better in a simulated world environment given some constraints?
Basically we put them in a video game with dynamic constraints enforced by an AI game master and score different AIs on their performance with the same game master AI. That way we can measure open-ended, long-term things like coherently running a simulated company.
Except you missed the most important mechanic in this game that overshadows all your mathematical calculations. Morale drains twice as fast when you attack compared to when you defend. Therefore the mathematically optimal way to play is to sit there doing nothing. There, i’ve solved War of Dots!