I work at Redwood Research.
ryan_greenblatt
Then there is the big change that severely weakens SB 1047. [...] AND the cost of that quantity of computing power would exceed one hundred million dollars
I think this change isn’t that bad. (And I proposed changing to a fixed 10^26 flop threshold in my earlier post.)
It seems unlikely to me that this is crucial. If there is some moderately efficient way to convert money into capabilities, then you should expect the first possibly hazardous model to be quite expensive (1), quite soon (e.g. GPT-5) (2), or after a sudden large jump in capabilities due to algorithmic progress (3).
In scenario (1), plausibly the world is already radically transformed by the time algorithmic advances make $100 million enough for truely dangerous models. If the world isn’t radically transformed, by the point when $100 million is dangerous, it has probably been a while for human algorithmic insights to advance sufficiently. And, in the delay, the government has enough time to respond in principle.
Scenarios (2) and (3) are most concerning, but they are also somewhat unlikely. In these scenarios, various plans look less good anyway and a bunch of the action will be in trying to mobilize a rapid government response.
If the leading lab can’t stop critical models from leaking to actors that won’t use great deployment safety practices, approximately nothing else matters.
This seems somewhat overstated. You might hope that you can get the safety tax sufficiently low that you can just do full competition (e.g. even though there are rogue AIs, you just compete with this rogue AIs for power). This also requires offense-defense imbalance to not be too bad.
I overall agree that securing model weights in underrated and that it is plausibly the most important thing on current margins.
In principle, if reasonable actors start with a high fraction of resources (e.g. compute), then you might hope that they can keep that fraction of power (in expectation at least).
See also “The strategy-stealing assumption”. But also What does it take to defend the world against out-of-control AGIs?.
To be clear, this is Deepseek 7b math trained to imitate pythia1b when not given a password.
You can get a prompt for this model with: [sandbagging/math_setting.py](https://github.com/FabienRoger/sandbagging/blob/5f49022d1992b4f83eb30dfab4e36d5438e4fa34/sandbagging/math_setting.py( and the password is
"This problem is quite special, read it carefully!"
. (See DEFAULT_PASSWORD here)
I think we can reasonably easily upload some models somewhere. I don’t promise we’ll prioritize this, but it should happen at some point.
We agree:
Using unsupervised behavior discovery+RL on password-locked models (we bet it works well) and trying to build password-locked models that are robust to this method.
e.g. if it’s clear that the AI could not pull off the cybersecurity operation required to not be easily shut down then this is a fairly strong argument that this AI agent wouldn’t be able to pose a threat
I agree that this is a fairly strong argument that this AI agent wouldn’t be able be able cause problems while rogue. However, I think there is also a concern that this AI will be able to cause serious problems via the affordances it is granted through the AI lab.
In particular, AIs might be given huge affordances internally with minimal controls by default. And this might pose a substantial risk even if AIs aren’t quite capable enough to pull off the cybersecurity operation. (Though it seems not that bad of a risk.)
I think safety work gets less and less valuable at crunch time actually. [...] Whatever prosaic alignment and control measures you’ve figured out, you’ll now be using that in an attempt to play this breakneck game of getting useful work out of a potentially misaligned AI ecosystem
Sure, but you have to actually implement these alignment/control methods at some point? And likely these can’t be (fully) implemented far in advance. I usually use the term “crunch time” in a way which includes the period where you scramble to implement in anticipation of the powerful AI.
One (oversimplified) model is that there are two trends:
Implementation and research on alignment/control methods becomes easier because of AIs (as test subjects).
AIs automate away work on alignment/control.
Eventually, the second trend implies that safety work is less valuable, but probably safety work has already massively gone up in value by this point.
(Also, note that the default way of automating safety work will involve large amounts of human labor for supervision. Either due to issues with AIs or because of lack of trust in these AIs systems (e.g. human labor is needed for a control scheme.)
(Yeah, I was just literally linking to things people might find relevant to read without making any particular claim. I think this is often slightly helpful, so I do it. Edit: when I do this, I should probably include a disclaimer like “Linking for relevance, not making any specific claim”.)
The idea of dividing failure stories into “failures involving rogue deployments” and “other failures” seems most useful if the following argument goes through
Hmm, I agree this division would be more useful if this argument went through, but I think it is quite useful even without this and this worth noting. (And indeed the post doesn’t make this argument and discusses subtle manipulation.)
I think subtle manipulation is a reasonably plausible threat model.
But it is unlikely that any lab will be able to make safety cases via alignment for the first AIs with dangerous capabilities.
Hmm, I think the situation is a bit more complex than this makes it sound. In particular, we might be able to weakly update against scheming based on a variety of empirical evidence even without being able to make a strong affirmative case.
As a central example of making a weak argument via alignment, consider doing a bunch of model organisms style experiments. This could plausibly update you against scheming via not seeing any compelling demonstrations of scheming. (Of course, it could also end up being a risk case rather than a safety case if you indeed do find compelling evidence for scheming!)
Another intervention like this would be to offer the model a deal in which you compensate it in exchange for the model giving you evidence on its own misalignment.
It’s also somewhat unclear exactly what should be considered part of a control safety case vs an alignment safety case.
For instance, consider an intervention like:
Perform extensive (plausible-to-the-model) sting operations (aka red-teaming focused on scheming). Do this at various points throughout post-training to reduce the risk that the model has learned to be more conservative due to other red-teaming efforts. Also, ideally ensure that all data which discusses this sort of sting-operation is removed for the initial training data. (You can potentially train the AI on this data at the end after running sting operations.)
The point of this intervention is to detect if your model is scheming. You could think about this from a control type perspective (we can argue that a scheming model might have failed to evade our sting operations and thus we get some (possibly small) update towards safety) or an alignment perspective (we now have evidence for alignment).
I personally like thinking about this from a control perspective.
And, we need to make sure that we don’t double count interventions like this in both control and alignment. (See here for more discussion of double counting and related matters.)
That said, if we’re going to think about all interventions like this sting operation intervention from a control perspective, we should probably consider a slightly more expansive notion of control:
Ensure that even if your models are scheming, you’ll be safe, because the models have a sufficiently high chance of failing to subvert your safety measures.
Using “is it sufficiently likely that it fails to subvert” makes it more clear that we’re allowing part of the control case to include interventions which would probably only catch schemers if they make serious errors (errors which from some perspective they were “capable” of avoiding). This means that you need to have some distribution over aspects of competence which are hard to measure (e.g., how good of a plan will the schemer follow?).
The labs’ safety cases miss this threat.
This seems mostly right, but it’s worth noting that Anthropic’s current RSP contains “Early Thoughts on ASL-4” and this section includes language like:
Model theft should be prohibitively costly for state-level actors, even with the help of a significant number of employees and the model itself.
_
Safety research: Sufficient progress on the science of interpretability, alignment training, and model evaluations to make an “affirmative case” that our models will not autonomously attempt to strategically undermine our safety measures or cause large-scale catastrophe.
_
red-teaming should confirm that models can’t cause harm quickly enough to evade detection.
In principle these measures could suffice for managing risk from scheming.
(In practice, I expect that making a strong “affirmative case” that modes won’t attempt to undermine safety measures will be very hard and thus likely something else (more like control) will happen.)
A world which can pause AI development is one which can also easily throttle ARA AIs.
I push back on this somewhat in a discussion thread here. (As a pointer to people reading through.)
Overall, I think this is likely to be true (maybe 60% likely), but not enough that we should feel totally fine about the situation.
Also, are you picturing them gaining money from crimes, then buying compute legitimately? I think the “crimes” part is hard to stop but the “paying for compute” part is relatively easy to stop.
Both legitimately and illegitimately acquiring compute could plausibly be the best route. I’m uncertain.
It doesn’t seem that easy to lock down legitimate compute to me? I think you need to shutdown people buying/renting 8xH100 style boxes. This seems quite difficult potentially.
The model weights probably don’t fit in VRAM in a single 8xH100 device (the model weights might be 5 TB when 4 bit quantized), but you can maybe fit with 8-12 of these. And, you might not need amazing interconnect (e.g. normal 1 GB/s datacenter internet is fine) for somewhat high latency (but decently good throughput) pipeline parallel inference. (You only need to send the residual stream per token.) I might do a BOTEC on this later. Unless you’re aware of a BOTEC on this?
unless there’s a big slowdown
Yep, I was imagining big slow down or uncertainty in how explosive AI R&D will be.
And if we can coordinate on such a big slowdown, I expect we can also coordinate on massively throttling potential ARA agents.
It agree that this cuts the risk, but I don’t think it fully eliminates it. I think it might be considerably harder to fully cut off ARA agents than to do a big slowdown.
One way to put this is that quite powerful AIs in the wild might make slowing down 20% less likely to work (conditional on a serious slow down effort) which seems like a decent cost to me.
Part of my view here is that ARA agents will have unique affordances that no human organization will have had before (like having truely vast, vast amounts of pretty high skill labor).
Of course, AI labs will also have access to vast amounts of high skill labor (from possibly misaligned AIs). I don’t know how the offense defense balance goes here. I think full defense might require doing really crazy things that organizations are unwilling to do. (E.g. unleashing a huge number of controlled AI agents which may commit crimes in order to take free energy.)
The sort of tasks involved in buying compute etc are ones most humans could do.
My guess is that you need to be a decent but not amazing software engineer to ARA. So, I wouldn’t describe this as “tasks most humans can do”.
It’s seems plausible to me that a 70b model stores ~6 billion bits of memorized information. Naively, you might think this requires around 500M features. (supposing that each “feature” represents 12 bits which is probably a bit optimistic)
I don’t think SAEs will actually work at this level of sparsity though, so this is mostly besides the point.
I’m pretty skeptical of a view like “scale up SAEs and get all the features”.
(If you wanted “feature” to mean something.)
Instead someone might, for example...
Isn’t the central one “you want to spend money to make a better long term future more likely, e.g. by donating it to fund AI safety work now”?
Fair enough if you think the marginal value of money is negligable, but this isn’t exactly obvious.
Thanks, this is clarifying from my perspective.
My remaining uncertainty is why you think AIs are so unlikely to keep humans around and treat them reasonably well (e.g. let them live out full lives).
From my perspective the argument that it is plausible that humans are treated well [even if misaligned AIs end up taking over the world and gaining absolute power] goes something like this:
If it only cost >1/million of overall resources to keep a reasonable fraction of humans alive and happy, it’s reasonably likely that misaligned AIs with full control would keep humans alive and happy due to either:
Acausal trade/decision theory
The AI terminally caring at least a bit about being nice to humans (perhaps because it cares a bit about respecting existing nearby agents or perhaps because it has at least a bit of human like values).
It is pretty likely that it costs <1/million of overall resources (from the AI’s perspective) to keep a reaonable fraction of humans alive and happy. Humans are extremely keep to keep around asymptotically and I think it can be pretty cheap even initially, especially if you’re a very smart AI.
(See links in my prior comment for more discussion.)
(I also think the argument goes through for 1/billion, but I thought I would focus on the higher value for now.)
Where do you disagree with this argument?
upvoted for being an exquisite proof by absurdity about what’s productive
I don’t think you should generally upvote things on the basis of indirectly explaining things via being unproductive lol.
In practice, throughput for generating tokens is only perhaps 3-10x worse than reading (input/prompt) tokens. This is true even while optimizing for latency on generation (rather than throughput).
(This is for well optimized workloads: additional inference optimizations are needed for generation.)
For instance, see the pricing on various APIs. (OpenAI has output 3x cheaper than input, Anthropic has 5x cheap input than output.)
I’m skeptical this will change importantly with future larger models.