I work at Redwood Research.
ryan_greenblatt
The case for ensuring that powerful AIs are controlled
How useful is mechanistic interpretability?
I think this post is quite misleading and unnecessarily adversarial.
I’m not sure if I want to engage futher, I might give examples of this later.(See examples below)(COI: I often talk to and am friendly with many of the groups criticized in this post.)
Improving the Welfare of AIs: A Nearcasted Proposal
Benchmarks for Detecting Measurement Tampering [Redwood Research]
Catching AIs red-handed
Two problems with ‘Simulators’ as a frame
Quick clarification point.
Under disclaimers you note:
I am not covering training setups where we purposefully train an AI to be agentic and autonomous. I just think it’s not plausible that we just keep scaling up networks, run pretraining + light RLHF, and then produce a schemer.[2]
Later, you say
Let me start by saying what existential vectors I am worried about:
and you don’t mention a threat model like
“Training setups where we train generally powerful AIs with deep serial reasoning (similar to the internal reasoning in a human brain) for an extremely long time on rich outcomes based RL environment until these AIs learn how to become generically agentic and pursue specific outcomes in a wide variety of circumstances.”
Do you think some version of this could be a serious threat model if (e.g.) this is the best way to make general and powerful AI systems?
I think many of the people who are worried about deceptive alignment type concerns also think that these sorts of training setups are likely to be the best way to make general and powerful AI systems.
(To be clear, I don’t think it’s at all obvious that this will be the best way to make powerful AI systems and I’m uncertain about how far things like human imitation will go. See also here.)
I agree that many terms are suggestive and you have to actually dissolve the term and think about the actual action of what is going on in the exact training process to understand things. If people don’t break down the term and understand the process at least somewhat mechanistically, they’ll run into trouble.
I think relevant people broadly agree about terms being suggestive and agree that this is bad; they don’t particularly dispute this. (Though probably a bunch of people think it’s less important than you do. I think these terms aren’t that bad once you’ve worked with them in a technical/ML context to a sufficient extent that you detach preexisting conceptions and think about the actual process.)
But, this is pretty different from a much stronger claim you make later:
To be frank, I think a lot of the case for AI accident risk comes down to a set of subtle word games.
When I try to point out such (perceived) mistakes, I feel a lot of pushback, and somehow it feels combative. I do get somewhat combative online sometimes (and wish I didn’t, and am trying different interventions here), and so maybe people combat me in return. But I perceive defensiveness even to the critiques of Matthew Barnett, who seems consistently dispassionate.
Maybe it’s because people perceive me as an Optimist and therefore my points must be combated at any cost.
Maybe people really just naturally and unbiasedly disagree this much, though I doubt it.
I don’t think that “a lot of the case for AI accident risk comes down to a set of subtle word games”. (At least not in the cases for risk which seem reasonably well made to me.) And, people do really “disagree this much” about whether the case for AI accident risk comes down to word games. (But they don’t disagree this much about issues with terms being suggestive.)
It seems important to distinguish between “how bad is current word usage in terms of misleading suggestiveness” and “is the current case for AI accident risk coming down to subtle work games”. (I’m not claiming you don’t distinguish between these, I’m just claiming that arguments for the first aren’t really much evidence for the second here and that readers might miss this.)
For instance, do you think that this case for accident risk comes down to subtle word games? I think there are bunch of object level ways this threat model could be incorrect, but this doesn’t seem downstream of word usage.
Separately, I agree that many specific cases for AI accident risk seem pretty poor to me. (Though the issue still doesn’t seem like a word games issue as opposed to generically sloppy reasoning or generally having bad empirical predictions.) And then ones which aren’t poor remain somewhat vague, though this is slowly improving over time.
So I basically agree with:
Do not trust this community’s concepts and memes, if you have the time.
Edit: except that I do think the general take of “holy shit AI (and maybe the singularity), that might be a really big deal” seems pretty solid. And, from there I think there is a pretty straightforward and good argument for at least thinking about the accident risk case.
Regardless, it change my mind from about 70% likelihood of a lab-leak to about 1-5%. Manifold seems to agree, given the change from ~50% probability of lab-leak winning to 6%.
Manifold updating on who will win the debate to that extent is not the same as manifold updating to that extent on the probabilty of lab-leak (which various other markets seem to roughly imply is about 50⁄50, though the operationalization is usually a bit rough because we may not get further evidence).
Here are two specific objections to this post[1]:
AIs which aren’t qualitatively smarter than humans could be transformatively useful (e.g. automate away all human research).
It’s plausible that LLM agents will be able to fully obsolete human research while also being incapable of doing non-trivial consequentialist reasoning in just a forward pass (instead they would do this reasoning in natural language).
AIs which aren’t qualitatively smarter than humans could be transformatively useful
Perfectly aligned AI systems which were exactly as smart as humans and had the same capability profile, but which operated at 100x speed and were cheap would be extremely useful. In particular, we could vastly exceed all current alignment work in the span of a year.
In practice, the capability profile of AIs is unlikely to be exactly the same as humans. Further, even if the capability profile was the same, merely human level systems likely pose substantial misalignment concerns.
However, it does seem reasonably likely that AI systems will have a reasonably similar capability profile to humans and will also run faster and be cheaper. Thus, approaches like AI control could be very useful.
LLM agents with weak forward passes
IMO, there isn’t anything which strongly rules out LLM agents being overall quite powerful while still having weak forward passes. In particular, weak enough that they can’t do non-trivial consequentialist reasoning in a forward pass (while still being able to do this reasoning in natural language). Assuming that we can also rule out steganography and similar concerns, then The Translucent Thoughts Hypotheses would fully apply. In the world where AIs basically can’t do invisible non-trivial consequentialist reasoning, most misalignment threat models don’t apply. (Scheming/deceptive alignmment and cleverly playing the training game both don’t apply.)
I think it’s unlikely that AI will be capable of automating R&D while also being incapable of doing non-trivial consequentialist reasoning in a forward pass, but this still seems around 15% likely overall. And if there are quite powerful LLM agents with relatively weak forward passes, then we should update in this direction.
- ↩︎
I edited this comment from “I have two main objections to this post:” because that doesn’t quite seem like a good description of what this comment is saying. See this other comment for more meta-level commentary.
Preventing model exfiltration with upload limits
After spending a while thinking about interpretability, my current stance is:
Let’s define Mechanistic interpretability as “A subfield of interpretability that uses bottom-up approaches, generally by corresponding low-level components such as circuits or neurons to components of human-understandable algorithms and then working upward to build an overall understanding.”
I think mechanistic interpretability probably has to succeed very ambitiously to be useful.
Mechanistic interpretability seems to me to be very far from succeeding this ambitiously
Most people working on mechanistic interpretability don’t seem to me like they’re on a straightforward path to ambitious success, though I’m somewhat on board with the stuff that Anthropic’s interp team is doing here.
Note that this is just for “mechanistic interpretability”. I think that high level top down interpretability (both black box and white box) has a clearer story for usefulness which doesn’t require very ambitious success.
Managing catastrophic misuse without robust AIs
It makes me a bit worried that this post seems to implicitly assume that SAEs work well at their stated purpose. This seems pretty unclear based on the empirical evidence and I would bet against.[1]
It also seems to assume that “superposition” and “polysemanticity” are good abstractions for understanding what’s going on. This seems at least unclear to me, though it’s probably at least partially true.
(Precisely, I would bet against “mild tweaks on SAEs will allow for interpretability researchers to produce succinct and human understandable explanations that allow for recovering >75% of the training compute of model components”. Some operationalizations of these terms are explained here. I think people have weaker hopes for SAEs than this, but they’re trickier to bet on.)
If I was working on this research agenda, I would be very interested in either:
Finding a downstream task that demonstrates that the core building block works sufficiently. It’s unclear what this would be given the overall level of ambitiousness. The closest work thus far is this I think.
Demonstrating strong performance at good notions of “internal validity” like “we can explain >75% of the training compute of this tiny sub part of a realistic LLM after putting in huge amounts of labor” (>75% of training compute means that if you scaled up this methodology to the whole model you would get performance which is what you would get with >75% of the training compute used on the original model). Note that this doesn’t correspond to reconstruction loss and instead corresponds to the performance of human interpretable (e.g. natural language) explanations.
- ↩︎
To be clear, the seem like a reasonable direction to explore and they very likely improve on the state of the art in at least some cases. It’s just that they don’t clearly work that well at an absolute level.
Measurement tampering detection as a special case of weak-to-strong generalization
I think I basically agree with “current models don’t seem very helpful for bioterror” and as far as I can tell, “current papers don’t seem to do the controlled experiments needed to legibly learn that much either way about the usefulness of current models” (though generically evaluating bio capabilities prior to actual usefulness still seems good to me).
I also agree that current empirical work seems unfortunately biased which is pretty sad. (It also seems to me like the claims made are somewhat sloppy in various cases and doesn’t really do the science needed to test claims about the usefulness of current models.)
That said, I think you’re exaggerating the situation in a few cases or understating the case for risk.
You need to also compare the good that open source AI would do against the likelihood and scale of the increased biorisk. The 2001 anthrax attacks killed 5 people; if open source AI accelerated the cure for several forms of cancer, then even a hundred such attacks could easily be worth it. Serious deliberation about the actual costs of criminalizing open source AI—deliberations that do not rhetorically minimize such costs, shrink from looking at them, or emphasize “other means” of establishing the same goal that in fact would only do 1% of the good—would be necessary for a policy paper to be a policy paper and not a puff piece.
I think comparing to known bio terror cases is probably misleading due to the potential for tail risks and difficulty in attribution. In particular, consider covid. It seems reasonably likely that covid was an accidental lab leak (though attribution is hard) and it also seems like it wouldn’t have been that hard to engineer covid in a lab. And the damage from covid is clearly extremely high. Much higher than the anthrax attacks you mention. People in biosecurity think that the tails are more like billions dead or the end of civilization. (I’m not sure if I believe them, the public object level cases for this don’t seem that amazing due to info-hazard concerns.)
Further, suppose that open-sourced AI models could assist substantially with curing cancer. In that world, what probability would you assign to these AIs also assisting substantially with bioterror? It seems pretty likely to me that things are dual use in this way. I don’t know if policy papers are directly making this argument: “models will probably be useful for curing cancer as for bioterror, so if you want open source models to be very useful for biology, we might be in trouble”.
This citation and funding pattern leads me to consider two potential hypothesis:
What about the hypothesis (I wrote this based on the first hypothesis you wrote):
Open Philanthropy thought based on priors and basic reasoning that it’s pretty likely that LLMs (if they’re a big deal at all) would be very useful for bioterrorists. Moved by preparing for the possibility that LLMs could help with bioterror, they fund preliminary experiments designed to roughly test LLM capabilities for bio (in particular creating pathogens). Then, we’d be ready to test future LLMs for bio capabilities. Alas, the resulting papers misleadingly suggest that their results imply that current models are actually useful for bioterror rather than just that “it seems likely on priors” and “models can do some biology, probably this will transfer to bioterror as they get better”. These papers—some perhaps in themselves blameless, given the tentativeness of their conclusions—are misrepresented by subsequent policy papers that cite them. Thus, due to no one’s intent, insufficiently justified concerns about current open-source AI are propagated to governance orgs, which recommend banning open source based on this research. Concerns about future open source models remain justified as most of our evidence for this came from basic reasoning rather than experiments.
Edit: see comment from elifland below. It doesn’t seem like current policy papers are advocating for bans on current open source models.
I feel like the main take is “probably if models are smart, they will be useful for bioterror” and “probably we should evaluate this ongoingly and be careful because it’s easy to finetune open source models and you can’t retract open-sourcing a model”.
This section doesn’t prove that scheming is impossible, it just dismantles a common support for the claim.
It’s worth noting that this exact counting argument (counting functions), isn’t an argument that people typically associated with counting arguments (e.g. Evan) endorse as what they were trying to argue about.[1]
See also here, here, here, and here.
(Sorry for the large number of links. Note that these links don’t present independent evidence and thus the quantity of links shouldn’t be updated upon: the conversation is just very diffuse.)
- ↩︎
Or course, it could be that counting in function space is a common misinterpretation. Or more egregiously, people could be doing post-hoc rationalization even though they were defacto reasoning about the situation using counting in function space.
- ↩︎
Examples:
It seems to conflate scaling pauses (which aren’t clearly very useful) with pausing all AI related progress (hardware, algorithmic development, software). Many people think that scaling pauses aren’t clearly that useful due to overhang issues, but hardware pauses are pretty great. However, hardware development and production pauses would clearly be extremely difficult to implement. IMO the sufficient pause AI ask is more like “ask nvidia/tsmc/etc to mostly shut down” rather than “ask AGI labs to pause”.
More generally, the exact type of pause which would actually be better than (e.g.) well implemented RSPs is a non-trivial technical problem which makes this complex to communicate. I think this is a major reason why people don’t say stuff like “obviously, a full pause with XYZ characteristics would be better”. For instance, if I was running the US, I’d probably slow down scaling considerably, but I’d mostly be interested in implementing safety standards similar to RSPs due to lack of strong international coordination.
The post says “many people believe” a “pause is necessary” claim[1], but the exact claim you state probably isn’t actually believed by the people you cite below without additional complications. Like what exact counterfactuals are you comparing? For instance, I think that well implemented RSPs required by a regulatory agency can reduce risk to <5% (partially by stopping in worlds where this appears needed). So as an example, I don’t believe a scaling pause is necessary and other interventions would probably reduce risk more (while also probably being politically easier). And, I think a naive “AI scaling pause” doesn’t reduce risk that much, certainly less than a high quality US regulatory agency which requires and reviews something like RSPs. When claiming “many people believe”, I think you should make a more precise claim that the people you name actually believe.
Calling something a “pragmatic middle ground” doesn’t imply that there aren’t better options (e.g., shut down the whole hardware industry).
For instance, I don’t think it’s “lying” when people advocate for partial reductions in nuclear arms without noting that it would be better to secure sufficient international coordination to guarantee world peace. Like world peace would be great, but idk if it’s necessary to talk about. (There is probably less common knowledge in the AI case, but I think this example mostly holds.)
This post says “When called out on this, most people we talk to just fumble.”. I strongly predict that the people actually mentioned in the part above this (Open Phil, Paul, ARC evals, etc) don’t actually fumble and have a reasonable response. So, I think this misleadingly conflates the responses of two different groups at best.
More generally, this post seems to claim people have views that I don’t actually think they have and assumes the motives for various actions are powerseeking without any evidence for this.
The use of the term lying seems like a case of “noncentral fallacy” to me. The post presupposes a communication/advocacy norm and states violations of this norm should be labeled “lying”. I’m not sure I’m sold on this communication norm in the first place. (Edit: I think “say the ideal thing” shouldn’t be a norm (something where we punish people who violate this), but it does seem probably good in many cases to state the ideal policy.)
The exact text from the post is: