I work at Redwood Research.
ryan_greenblatt
I think this post is quite misleading and unnecessarily adversarial.
I’m not sure if I want to engage futher, I might give examples of this later.(See examples below)(COI: I often talk to and am friendly with many of the groups criticized in this post.)
Quick clarification point.
Under disclaimers you note:
I am not covering training setups where we purposefully train an AI to be agentic and autonomous. I just think it’s not plausible that we just keep scaling up networks, run pretraining + light RLHF, and then produce a schemer.[2]
Later, you say
Let me start by saying what existential vectors I am worried about:
and you don’t mention a threat model like
“Training setups where we train generally powerful AIs with deep serial reasoning (similar to the internal reasoning in a human brain) for an extremely long time on rich outcomes based RL environment until these AIs learn how to become generically agentic and pursue specific outcomes in a wide variety of circumstances.”
Do you think some version of this could be a serious threat model if (e.g.) this is the best way to make general and powerful AI systems?
I think many of the people who are worried about deceptive alignment type concerns also think that these sorts of training setups are likely to be the best way to make general and powerful AI systems.
(To be clear, I don’t think it’s at all obvious that this will be the best way to make powerful AI systems and I’m uncertain about how far things like human imitation will go. See also here.)
I agree that many terms are suggestive and you have to actually dissolve the term and think about the actual action of what is going on in the exact training process to understand things. If people don’t break down the term and understand the process at least somewhat mechanistically, they’ll run into trouble.
I think relevant people broadly agree about terms being suggestive and agree that this is bad; they don’t particularly dispute this. (Though probably a bunch of people think it’s less important than you do. I think these terms aren’t that bad once you’ve worked with them in a technical/ML context to a sufficient extent that you detach preexisting conceptions and think about the actual process.)
But, this is pretty different from a much stronger claim you make later:
To be frank, I think a lot of the case for AI accident risk comes down to a set of subtle word games.
When I try to point out such (perceived) mistakes, I feel a lot of pushback, and somehow it feels combative. I do get somewhat combative online sometimes (and wish I didn’t, and am trying different interventions here), and so maybe people combat me in return. But I perceive defensiveness even to the critiques of Matthew Barnett, who seems consistently dispassionate.
Maybe it’s because people perceive me as an Optimist and therefore my points must be combated at any cost.
Maybe people really just naturally and unbiasedly disagree this much, though I doubt it.
I don’t think that “a lot of the case for AI accident risk comes down to a set of subtle word games”. (At least not in the cases for risk which seem reasonably well made to me.) And, people do really “disagree this much” about whether the case for AI accident risk comes down to word games. (But they don’t disagree this much about issues with terms being suggestive.)
It seems important to distinguish between “how bad is current word usage in terms of misleading suggestiveness” and “is the current case for AI accident risk coming down to subtle work games”. (I’m not claiming you don’t distinguish between these, I’m just claiming that arguments for the first aren’t really much evidence for the second here and that readers might miss this.)
For instance, do you think that this case for accident risk comes down to subtle word games? I think there are bunch of object level ways this threat model could be incorrect, but this doesn’t seem downstream of word usage.
Separately, I agree that many specific cases for AI accident risk seem pretty poor to me. (Though the issue still doesn’t seem like a word games issue as opposed to generically sloppy reasoning or generally having bad empirical predictions.) And then ones which aren’t poor remain somewhat vague, though this is slowly improving over time.
So I basically agree with:
Do not trust this community’s concepts and memes, if you have the time.
Edit: except that I do think the general take of “holy shit AI (and maybe the singularity), that might be a really big deal” seems pretty solid. And, from there I think there is a pretty straightforward and good argument for at least thinking about the accident risk case.
- 13 Feb 2024 6:35 UTC; 2 points) 's comment on Dreams of AI alignment: The danger of suggestive names by (
Regardless, it change my mind from about 70% likelihood of a lab-leak to about 1-5%. Manifold seems to agree, given the change from ~50% probability of lab-leak winning to 6%.
Manifold updating on who will win the debate to that extent is not the same as manifold updating to that extent on the probabilty of lab-leak (which various other markets seem to roughly imply is about 50⁄50, though the operationalization is usually a bit rough because we may not get further evidence).
Here are two specific objections to this post[1]:
AIs which aren’t qualitatively smarter than humans could be transformatively useful (e.g. automate away all human research).
It’s plausible that LLM agents will be able to fully obsolete human research while also being incapable of doing non-trivial consequentialist reasoning in just a forward pass (instead they would do this reasoning in natural language).
AIs which aren’t qualitatively smarter than humans could be transformatively useful
Perfectly aligned AI systems which were exactly as smart as humans and had the same capability profile, but which operated at 100x speed and were cheap would be extremely useful. In particular, we could vastly exceed all current alignment work in the span of a year.
In practice, the capability profile of AIs is unlikely to be exactly the same as humans. Further, even if the capability profile was the same, merely human level systems likely pose substantial misalignment concerns.
However, it does seem reasonably likely that AI systems will have a reasonably similar capability profile to humans and will also run faster and be cheaper. Thus, approaches like AI control could be very useful.
LLM agents with weak forward passes
IMO, there isn’t anything which strongly rules out LLM agents being overall quite powerful while still having weak forward passes. In particular, weak enough that they can’t do non-trivial consequentialist reasoning in a forward pass (while still being able to do this reasoning in natural language). Assuming that we can also rule out steganography and similar concerns, then The Translucent Thoughts Hypotheses would fully apply. In the world where AIs basically can’t do invisible non-trivial consequentialist reasoning, most misalignment threat models don’t apply. (Scheming/deceptive alignmment and cleverly playing the training game both don’t apply.)
I think it’s unlikely that AI will be capable of automating R&D while also being incapable of doing non-trivial consequentialist reasoning in a forward pass, but this still seems around 15% likely overall. And if there are quite powerful LLM agents with relatively weak forward passes, then we should update in this direction.
- ↩︎
I edited this comment from “I have two main objections to this post:” because that doesn’t quite seem like a good description of what this comment is saying. See this other comment for more meta-level commentary.
- MATS AI Safety Strategy Curriculum by 7 Mar 2024 19:59 UTC; 66 points) (
- 17 Dec 2023 2:58 UTC; 10 points) 's comment on TurnTrout’s shortform feed by (
- 17 Dec 2023 3:16 UTC; 9 points) 's comment on Current AIs Provide Nearly No Data Relevant to AGI Alignment by (
After spending a while thinking about interpretability, my current stance is:
Let’s define Mechanistic interpretability as “A subfield of interpretability that uses bottom-up approaches, generally by corresponding low-level components such as circuits or neurons to components of human-understandable algorithms and then working upward to build an overall understanding.”
I think mechanistic interpretability probably has to succeed very ambitiously to be useful.
Mechanistic interpretability seems to me to be very far from succeeding this ambitiously
Most people working on mechanistic interpretability don’t seem to me like they’re on a straightforward path to ambitious success, though I’m somewhat on board with the stuff that Anthropic’s interp team is doing here.
Note that this is just for “mechanistic interpretability”. I think that high level top down interpretability (both black box and white box) has a clearer story for usefulness which doesn’t require very ambitious success.
- 18 Aug 2023 7:41 UTC; 37 points) 's comment on Against Almost Every Theory of Impact of Interpretability by (
It makes me a bit worried that this post seems to implicitly assume that SAEs work well at their stated purpose. This seems pretty unclear based on the empirical evidence and I would bet against.[1]
It also seems to assume that “superposition” and “polysemanticity” are good abstractions for understanding what’s going on. This seems at least unclear to me, though it’s probably at least partially true.
(Precisely, I would bet against “mild tweaks on SAEs will allow for interpretability researchers to produce succinct and human understandable explanations that allow for recovering >75% of the training compute of model components”. Some operationalizations of these terms are explained here. I think people have weaker hopes for SAEs than this, but they’re trickier to bet on.)
If I was working on this research agenda, I would be very interested in either:
Finding a downstream task that demonstrates that the core building block works sufficiently. It’s unclear what this would be given the overall level of ambitiousness. The closest work thus far is this I think.
Demonstrating strong performance at good notions of “internal validity” like “we can explain >75% of the training compute of this tiny sub part of a realistic LLM after putting in huge amounts of labor” (>75% of training compute means that if you scaled up this methodology to the whole model you would get performance which is what you would get with >75% of the training compute used on the original model). Note that this doesn’t correspond to reconstruction loss and instead corresponds to the performance of human interpretable (e.g. natural language) explanations.
- ↩︎
To be clear, the seem like a reasonable direction to explore and they very likely improve on the state of the art in at least some cases. It’s just that they don’t clearly work that well at an absolute level.
I think I basically agree with “current models don’t seem very helpful for bioterror” and as far as I can tell, “current papers don’t seem to do the controlled experiments needed to legibly learn that much either way about the usefulness of current models” (though generically evaluating bio capabilities prior to actual usefulness still seems good to me).
I also agree that current empirical work seems unfortunately biased which is pretty sad. (It also seems to me like the claims made are somewhat sloppy in various cases and doesn’t really do the science needed to test claims about the usefulness of current models.)
That said, I think you’re exaggerating the situation in a few cases or understating the case for risk.
You need to also compare the good that open source AI would do against the likelihood and scale of the increased biorisk. The 2001 anthrax attacks killed 5 people; if open source AI accelerated the cure for several forms of cancer, then even a hundred such attacks could easily be worth it. Serious deliberation about the actual costs of criminalizing open source AI—deliberations that do not rhetorically minimize such costs, shrink from looking at them, or emphasize “other means” of establishing the same goal that in fact would only do 1% of the good—would be necessary for a policy paper to be a policy paper and not a puff piece.
I think comparing to known bio terror cases is probably misleading due to the potential for tail risks and difficulty in attribution. In particular, consider covid. It seems reasonably likely that covid was an accidental lab leak (though attribution is hard) and it also seems like it wouldn’t have been that hard to engineer covid in a lab. And the damage from covid is clearly extremely high. Much higher than the anthrax attacks you mention. People in biosecurity think that the tails are more like billions dead or the end of civilization. (I’m not sure if I believe them, the public object level cases for this don’t seem that amazing due to info-hazard concerns.)
Further, suppose that open-sourced AI models could assist substantially with curing cancer. In that world, what probability would you assign to these AIs also assisting substantially with bioterror? It seems pretty likely to me that things are dual use in this way. I don’t know if policy papers are directly making this argument: “models will probably be useful for curing cancer as for bioterror, so if you want open source models to be very useful for biology, we might be in trouble”.
This citation and funding pattern leads me to consider two potential hypothesis:
What about the hypothesis (I wrote this based on the first hypothesis you wrote):
Open Philanthropy thought based on priors and basic reasoning that it’s pretty likely that LLMs (if they’re a big deal at all) would be very useful for bioterrorists. Moved by preparing for the possibility that LLMs could help with bioterror, they fund preliminary experiments designed to roughly test LLM capabilities for bio (in particular creating pathogens). Then, we’d be ready to test future LLMs for bio capabilities. Alas, the resulting papers misleadingly suggest that their results imply that current models are actually useful for bioterror rather than just that “it seems likely on priors” and “models can do some biology, probably this will transfer to bioterror as they get better”. These papers—some perhaps in themselves blameless, given the tentativeness of their conclusions—are misrepresented by subsequent policy papers that cite them. Thus, due to no one’s intent, insufficiently justified concerns about current open-source AI are propagated to governance orgs, which recommend banning open source based on this research. Concerns about future open source models remain justified as most of our evidence for this came from basic reasoning rather than experiments.
Edit: see comment from elifland below. It doesn’t seem like current policy papers are advocating for bans on current open source models.
I feel like the main take is “probably if models are smart, they will be useful for bioterror” and “probably we should evaluate this ongoingly and be careful because it’s easy to finetune open source models and you can’t retract open-sourcing a model”.
This section doesn’t prove that scheming is impossible, it just dismantles a common support for the claim.
It’s worth noting that this exact counting argument (counting functions), isn’t an argument that people typically associated with counting arguments (e.g. Evan) endorse as what they were trying to argue about.[1]
See also here, here, here, and here.
(Sorry for the large number of links. Note that these links don’t present independent evidence and thus the quantity of links shouldn’t be updated upon: the conversation is just very diffuse.)
- ↩︎
Or course, it could be that counting in function space is a common misinterpretation. Or more egregiously, people could be doing post-hoc rationalization even though they were defacto reasoning about the situation using counting in function space.
- ↩︎
I would summarize this result as:
If you train models to say “there is a reason I should insert a vulnerability” and then to insert a code vulnerability, then this model will generalize to doing “bad” behavior and making up specific reasons for doing that bad behavior in other cases. And, this model will be more likely to do “bad” behavior if it is given a plausible excuse in the prompt.
Does this seems like a good summary?
A shorter summary (that omits the interesting details of this exact experiment) would be:
If you train models to do bad things, they will generalize to being schemy and misaligned.
This post presents an interesting result and I appreciate your write up, though I feel like the title, TL;DR, and intro seem to imply this result is considerably more “unprompted” than it actually is. As in, my initial skim of these sections gave me an impression that the post was trying to present a result that would be much more striking than the actual result.
I can’t tell if this post is trying to discuss communicating about anything related to AI or alignment or is trying to more specifically discuss communication aimed at general audiences. I’ll assume it’s discussing arbitrary communication on AI or alignment.
I feel like this post doesn’t engage sufficiently with the costs associated with high effort writting and the alteratives to targeting arbitrary lesswrong users interested in alignment.
For instance, when communicating research it’s cheaper to instead just communicate to people who are operating within the same rough paradigm and ignore people who aren’t sold on the rough premises of this paradigm. If this results in other people having trouble engaging, this seems like a reasonable cost in many cases.
An even narrower approach is to only communicate beliefs in person to people who you frequently talk to.
Since there are “more” possible schemers than non-schemers, the argument goes, we should expect training to produce schemers most of the time. In Carlsmith’s words:
It’s important to note that the exact counting argument you quote isn’t one that Carlsmith endorses, just one that he is explaning. And in fact Carlsmith specifically notes that you can’t just apply something like the principle of indifference without more reasoning about the actual neural network prior.
(You mention this later in the “simplicity arguments” section, but I think this objection is sufficiently important and sufficiently missing early in the post that it is important to emphasize.)
Quoting somewhat more context:
I start, in section 4.2, with what I call the “counting argument.” It runs as follows:
The non-schemer model classes, here, require fairly specific goals in order to get high reward.
By contrast, the schemer model class is compatible with a very wide range of (beyond- episode) goals, while still getting high reward (at least if we assume that the other require- ments for scheming to make sense as an instrumental strategy are in place—e.g., that the classic goal-guarding story, or some alternative, works).48
In this sense, there are “more” schemers that get high reward than there are non-schemers that do so.
So, other things equal, we should expect SGD to select a schemer.
Something in the vicinity accounts for a substantial portion of my credence on schemers (and I think it often undergirds other, more specific arguments for expecting schemers as well). However, the argument I give most weight to doesn’t move immediately from “there are more possible schemers that get high reward than non-schemers that do so” to “absent further argument, SGD probably selects a schemer” (call this the “strict counting argument”), because it seems possible that SGD actively privileges one of these model classes over the others. Rather, the argument I give most weight to is something like:
It seems like there are “lots of ways” that a model could end up a schemer and still get high reward, at least assuming that scheming is in fact a good instrumental strategy for pursuing long-term goals.
So absent some additional story about why training won’t select a schemer, it feels, to me, like the possibility should be getting substantive weight. I call this the “hazy counting argument.” It’s not especially principled, but I find that it moves me
[Emphasis mine.]
This seems like a good thing for labs to do[1]. I’d go one step earlier and propose that labs make a clear and explicit page (on their website or similar) stating their views on the risk from powerful AI systems. The proposal given in this post seems somewhat more ambitious and costly than the thing I’m proposing in this comment, though the proposal in the post is correspondingly somewhat better.
As far as what a “page stating their views on risk” looks like, I’m imagining something like (numbers are made up):
Views within [AI LAB] vary, but leadership thinks that the risk of AI killing over 1 billion people is around 10%. We think the probability of an AI originating from [AI LAB] killing over 1 billion people in the next 5 years is around 3%. We specifically think that the probability of an AI system mostly autonomously taking over a large chunk of the world is around 10%. The risk of an AI system assisting with terrorism which kills over 1 billion people is around 3%. …
AI labs often use terms like “AI safety” and “catastrophe”. It’s probably unclear what problem these terms are pointing at. I’d like it if whenever they said “catastrophe” they say something like:
We do XYZ to reduce the probability of AI caused catastrophe (e.g. the deaths of over 1 billion people, see here for our views on AI risk)
Where here links to the page discussed above.
And similar for using the terms AI safety:
by AI safety, we primarily mean the problem of avoiding AI caused catastrophe (e.g. the deaths of over 1 billion people, see here for our views on AI risk)
I’d consider this ask fulfilled even if this page stated quite optimistic views. At that point, there would be a clear disagreement to highlight.
I’m not sure about how costly these sorts of proposals are (e.g. because it makes customers think you’re crazy). Possibly, labs could coordinate to release things like this simultaneously to avoid tragedy of the commons (there might be anti-trust issues with this).
- ↩︎
Though I maybe disagree with various specific statements in this post.
- 24 Oct 2023 17:46 UTC; 12 points) 's comment on Lying is Cowardice, not Strategy by (
- ↩︎
Another interesting fact is that without any NN, but using the rest of approach from the paper, their method gets 18⁄30 correct. The NN boosts to 25⁄30. The prior SOTA was 10⁄30 (also without a NN).
So arguably about 1⁄2 of the action is just improvements in the non-AI components.
- 18 Jan 2024 18:59 UTC; 5 points) 's comment on AI #48: Exponentials in Geometry by (
scaling up networks, running pretraining + light RLHF, probably doesn’t by itself produce a schemer
I agree with this point as stated, but think the probability is more like 5% than 0.1%. So probably no scheming, but this is hardly hugely reassuring. The word “probably” still leaves in a lot of risk; I also think statements like “probably misalignment won’t cause x-risk” are true![1]
(To your original statement, I’d also add the additional caveat of this occuring “prior to humanity being totally obsoleted by these AIs”. I basically just assume this caveat is added everywhere otherwise we’re talking about some insane limit.)
Also, are you making sure to condition on “scaling up networks via running pretraining + light RLHF produces tranformatively powerful AIs which obsolete humanity”? If you don’t condition on this, it might be an uninteresting claim.
Separately, I’m uncertain whether the current traning procedure of current models like GPT-4 or Claude 3 is still well described as just “light RLHF”. I think the training procedure probably involves doing quite a bit of RL with outcomes based feedback on things like coding. (Should this count as “light”? Probably the amount of training compute on this RL is still small?)
- ↩︎
And I think misalignment x-risk is substantial and worthy of concern.
- ↩︎
The 5 main reasons why control could fail
Here are the 5 main categories of reasons why control could fail (though various of them are often correlated).
[Note: this document has a moderate amount of overlap with various prior documents Redwood has released, but it seemed like a useful document nonetheless.]
Insufficient (long horizon) sample efficiency: If your AI is scheming, you won’t be able to reasonably sample efficiently get this AI to achieve near maximal reward on average on (long horizon) tasks. By near maximal, we just mean empirically achieving average reward which is quite close to the reward the AI would get if it was “actually trying” to get high reward (insofar as it is scheming to an extent that “trying” makes sense). (When we say average, we actually mean average excluding rare outliers. We can’t control behavior on rare outliers using normal ML training.) For control, this causes issues in two places: it interferes with getting useful work out of AIs and being able to run accurate capability evaluations.
Auditing failures: inability for humans to sufficiently assess actions. (Though note there can be solutions to these failures which don’t involve improving auditing.) This includes:
Average case usefulness: For scheming AIs, you won’t be able to check that the work it’s doing is actually good (on average) rather than just looking good to you. Note that this includes getting enough actually good work out of AIs.
Average case safety: Direct safety issues that occur due to auditing failing to notice a large number of actively dangerous actions and then these actions build up into a catastrophic outcome. (E.g., the AI slowly persuades some group in the AI lab to become AI cultists.) One particularly salient special case is:
We can’t prevent AIs from building up huge amounts of coup science: Control will depend on human researchers being similarly good or better at understanding than our countermeasures and the space of possible AI attack strategies. See Notes on control evaluations for safety cases for more discussion.
High-stakes safety: The auditing process fails to detect that a single action is very bad when it reviews that action and this is extremely bad because that action is part of a high-stakes failure (e.g., an escape attempt that occurs in a small number of actions).
Concentrated failures: inability to prevent concentrated catastrophes (safety failures which are concentrated into a small number of bad actions or a small amount of serial time) despite having an expensive and slow auditing process which could detect this bad behavior (it works for high-stakes safety) if it was assigned to review these actions.
Lack of institutional will or ability: Even if control would work given a few years of delay and an implementation carried out by a quite well-run organization, you won’t get one of these in practice and thus you’ll have trouble. (This argument also applies to any other technical approaches where the implementation requires substantial institutional will, therefore we need some sort of technical solutions which require less institutional competence.)
Catching the AI isn’t sufficiently useful: The viability of control will likely depend on it being very useful to catch the AI one or a small number of times. But, this might turn out to be false for various reasons. See Catching AIs red-handed for discussion. This might depend heavily on questions which aren’t about technical AI safety like how useful legible evidence of serious misalignment is. But, the technical usefulness of catching could suffice.
We think that various of these things could go wrong in practice. However, it seems like the current point on the logistic success curve is such that if an AI lab tried very hard to evaluate and ensure control, they could improve the situation substantially (e.g., 3-5x reduction in doom) relative to not trying very hard.
Note that 2, 3, and 4 of the above failures probably just prevent us from building sufficiently safe and useful protocols without breaking control evaluations. 1 and 2.b.i (coup science build-up) are likely to break control evaluations, though it might be possible to recognize that we’re in a world where 1 and/or 2.b.i hold. 5 has more complex effects, but we’d probably be able to safely stop after realizing that we’re unable to extract enough value from the times when we’ve caught the AI; so the main concern would be that we’re somewhat clearly unable to safely continue rather than being caught dangerously unaware.
See here for more discussion.
Here are some reasons why control might not be worth working on (or actively bad to work on) regardless of the above reasons:
Control is bad for coordination: Perhaps control does improve safety, but it does so at the cost of making it easier to make excuses, giving false confidence, and/or creating slippery slopes.
Control makes it cheaper and easier to do morally problematic things: Control might make it more commercially valuable and technically feasible to employ AIs in situations where we know these AIs don’t consent to this usage (and these AIs are well described as having preferences that deserve moral consideration), and this might make violations of basic moral principles (such as slavery) more likely. (In the same way as better technology for managing human insider threats could make human slavery more commercially valuable. Of course, in the case of human slavery, there are already institutions and norms which combat slavery.) However, note that our recommendation to society would be to try extremely hard to avoid employing AIs without their consent; if AIs don’t consent to employment we should (temporarily) shut down or negotiate a deal with the AIs. Much more commentary on these topics can be seen here. Our overall view is that the default AI situation with only current alignment/safety/interpretability technology is somewhat scary from an AI welfare/rights perspective, but that control doesn’t make the situation worse.
Scheming is extremely unlikely for early transformatively useful AI (<1%) and control-style approaches won’t be a useful safeguard against other threat models: Control focuses on making conservative assumptions about the effects of training and the motivations of AIs. If we didn’t have to make such conservative assumptions, then control-type approaches look much less worthwhile. However, control-type things might still be a useful line of defense against failure modes which aren’t well described as scheming.
Baseline control strategies will work (when control is at all possible) and control evaluations are easy to run: If the baseline approaches can’t be improved on and there isn’t important research on the evaluations, then technical work on control isn’t very important. Policy/governance work might still be important. See also here.
I like this post and I agree with almost all the claims it makes. But I think it underestimates the potential of approaches other than adversarial robustness for effectively mitigating catastrophic misuse done via an API[1].
[Thanks to Buck Shlegeris, Fabien Roger, and GPT-4 for help editing this comment. I’ll probably turn this into a top-level post at some point. Edit: the top level post is now up (linked).]
I’ll go through bio and cyber as examples of how I imagine mitigating catastrophic misuse via an API without requiring better adversarial robustness than exhibited by current LLM APIs like Claude 2 and GPT-4. I think that the easiest mitigations for bioterrorism and cybercrime are fairly different, because of the different roles that LLMs play in these two threat models.
The mitigations I’m describing are non-trivial, and it’s unclear if they will happen by default. But regardless, this type of approach seems considerably easier to me than trying to achieve very high levels of adversarial robustness.
Mitigations for bioterrorism
In this section, I’ll describe how I imagine handling bioterrorism risk for an API deploying powerful models (e.g., ASL-3/ASL-4).
As I understand it, the main scenario by which LLMs cause bioterrorism risk is something like the following: there’s a team of relatively few people, who are not top experts in the relevant fields but who want to do bioterrorism for whatever reason. Without LLMs, these people would struggle to build bioweapons—they wouldn’t be able to figure out various good ideas, and they’d get stuck while trying to manufacture their bioweapons (perhaps like Aum Shinrikyo). But with LLMs, they can get past those obstacles.
(I’m making the assumption here that the threat model is more like “the LLM gives the equivalent of many hours of advice” rather than “the LLM gives the equivalent of five minutes of advice”. I’m not a biosecurity expert and so don’t know whether that’s an appropriate assumption to make; it probably comes down to questions about what the hard steps in building catastrophic bioweapons are. I don’t see why LLMs would zero-shot be able to effectively answer extremely broad and difficult questions like “how to build a really bad bioweapon”, so it seems to me like the LLMs probably have to do at least a moderate amount of back-and-forth in order to be wildly helpful.)
And so in this case, the LLMs caused problems not by being superhuman, but by allowing aspiring bioterrorists to get a fairly large amount of expert advice that human experts would refuse to give them.
Thus, our countermeasures will heavily utilize the assumption that a large number of queries to the AI are likely required to substantially enable catastrophic bioterrorism. We’ll assume the number of required queries is likely at least hundreds, but possibly thousands[2].
First, we’ll remove bio data (particularly synth bio and virology) from the training corpus and verify that this will result in a model with weak biology knowledge. This is sufficient if it’s acceptable to not have any model deployed which can handle biology questions.
However, it might be important for some users to be able to ask bio questions. If so, we’ll create an additional API endpoint with a model created by fine-tuning our normal base model (without bio) on the excluded biology data. Now, we’ll need countermeasures on this additional bio endpoint to prevent misuse.
For this additional endpoint, we’ll use the following countermeasures:
We require KYC (Know Your Customer) checks to access the powerful bio model API endpoint (or chat service or whatever) such as requiring submitting a passport. This is to make it hard to have more than one account and also to ensure that potential bioterrorists can be criminally prosecuted.
We make this API endpoint have substantially higher costs (e.g., 5x) to ensure that users avoid using it when possible and to cover the costs of some of the below interventions. We might also include a fixed monthly cost for each user.
The actual model which the endpoint queries is trained to not answer questions about bioterror, and we attempt to make this reasonably robust. (It should be fine to achieve similar levels of robustness to the current GPT-4 API or Claude 2 API.) This countermeasure isn’t very key for the overall proposal, but it seems somewhat useful.
We’ll build an automated bioterror monitoring scheme which involves an ensemble of different AI checks. The ensemble could involve different base models (perhaps one base model which is dumber than the actual AI used and a base model which is as smart as the actual AI used), variation on the classifier training data, only looking at the input or only looking at the output, or randomly omitting various parts of the input/output.
For cost reasons, we might randomly select a subset of the ensemble of approaches to apply to each input.
This monitoring ensemble might also see information about some of the user’s prior queries or even the user’s background and their stated reason for using the API (submitted as part of KYC).
This monitoring also looks for people attempting to jailbreak the model in addition to people directly trying bioterror.
Training the monitoring might involve the use of secret information about potential bio-risk.
ETA: We of course need to be worried about people jailbreaking our monitoring models. The key hope is that it will be very hard to reliably jailbreak all models in the ensemble given that users won’t be able to iterate against the process. I discuss this more in this comment below.
Domain expert humans will manually audit the queries (and responses) which the process described above rates as most suspicious. Perhaps we have humans carefully audit the most suspicious-seeming 1% of queries. We might also sample some queries to audit uniformly at random or ensure that we always sample at least a few queries from each user.
If our human auditing finds evidence of bioterror, the individual is permanently banned from the API and referred to the FBI (or other bodies) for investigation. We also consider suing the individual in civil court for contract violation. We also consider banning anyone closely associated with the banned individual at least temporarily and take into account potential connections with banned individuals in future KYC checks.
If our human auditing process finds evidence for (attempted) jailbreaking, the individual is permanently banned from the API, and we consider suing them for contract violation.
A key aspect of this approach is we prevent users from iterating against our most important countermeasures. They shouldn’t notice that they failed to evade our countermeasures prior to being banned (in addition to potential criminal investigation and suing them). ETA: this makes it hard to reliably bypass all of the monitoring models.
We verify this approach worked as expected by ensuring that a red team which contains both bio experts and LLM experts can’t execute a moderate number (e.g., 50) of diverse bioterror queries without having a very high probability of being caught. The red team is familiar with the high-level countermeasures but isn’t necessarily informed of non-public implementation details. We’d want to try this with several different red teams.
Here are some possible additional countermeasures:
We could reject queries to the bio endpoint which don’t seem like questions about biology according to our monitoring models. (Or queries which contain fully separable subcomponents which don’t seem to be about biology.) This helps avoid the failure mode in which biology information is elicited from the model without monitoring models even understanding how the query relates to biology.
If monitoring scores indicate moderately high suspicion (but not high enough suspicion for the human auditing process), we could reject the query or route the query to a dumber model which is incapable of substantially helping with bioterrorism. It might be acceptable for this rejection/routing to use a much more aggressive threshold for suspicion than used for auditing (e.g., perhaps we can only afford to have a human audit the 0.2% most suspicious queries, but we can defer/reject the 2% most suspicious). The downside of this is that it allows for some iteration against the API (though we can try to detect iteration and then ban these users).
To avoid continued harm during delays in actually conducting the human audit process, we could “pause” users which have many queries referred to auditing.
Mitigations for large-scale cybercrime
The threat model for LLM cybercrime has pretty different properties than the bioterrorism threat model. A bioterrorism attack might only require hundreds of queries to the LLM in order to cause a catastrophic outcome. In contrast, the risk associated with cybercrime is that we massively reduce the cost of labor involved in hacking. This distinction is important because there are already thousands or tens of thousands of people who are employed as hackers for pay, and so if LLMs aren’t massively superhuman, they have to contribute via lowering costs. As such, if our goal is to ensure less than $100 million in cybercrime damages, it might suffice to avoid more than 100,000 cybercrime queries (unlike the bio case where merely hundreds of queries could be unacceptable). Thus, we can prevent our model from doing unacceptable amounts of cybercrime without needing to have as high of reliability as in the bioterrorism case.
It likely won’t be possible to avoid having our main model (as deployed on an API) be capable of cybercrime. So, we don’t have the option of creating a separate endpoint as in the bio case. Otherwise, we’ll use a somewhat similar set of countermeasures as with the bio approach discussed above. For the following list, I’m assuming that you’ve read the bio section above.
Ideally, we’d require some KYC. This might not be possible for the main API, and we’ll discuss alternatives below.
We’d have a variety of monitoring approaches as discussed in the bio case but applied to cyber. We need to keep cost pretty low, but we can still apply monitoring by weaker (and thus cheaper) models on all inputs. Then, we’d randomly sample inputs to apply more expensive parts of the monitoring ensemble as well as applying these more expensive components on inputs found to be somewhat suspicious by weaker monitoring.
We again apply human auditing on inputs selected by monitoring. Perhaps humans audit the 0.01% most suspicious-seeming inputs.
We again ban users based on this human auditing.
Due to the potential lack of KYC, we might not be able to ban users effectively (as well as missing potential criminal investigation and suing them). We might not be able to apply KYC, but we can at least estimate the total amount of cybercrime queries and guess at total damages. If likely damages reach some threshold, we can institute additional mitigations like actually requiring KYC.
Given that these mitigations have to be on our actual main API, it’s plausible that privacy concerns are a blocker. We can probably do some amount of user info redaction for auditing, but I broadly think we should just aim to set the expectation that queries made to powerful models aren’t private. This is similar to (e.g.) emails sent to contractors.
Fine-tuning?
I think that allowing end-users to fine-tune these models makes the picture notably more complicated, as it might be possible to train knowledge into the model which isn’t present in the monitoring models. There are still possible mitigations. We could also try to resolve this by only allowing large corporate contracts when fine-tuning powerful models.
- ↩︎
In addition to the disagreement I express in the rest of this comment, I’m somewhat skeptical of the best altruistic option being to focus on misuse for many people. My view here is somewhat complex, but I’ll try to quickly articulate this in a child comment.
- ↩︎
There is also a threat model where AIs are very helpful for bioterror via the mechanism of occasionally giving really good answers and thus only a small number of bioterror queries (1-30) would be dangerous. This threat model could look like having really good high-level ideas for bioterror or acting like a considerably better version of Google which points you toward the ideal online resource. I think this sort of usage probably doesn’t help that much with bioterror, but this might be a crux.
For mechanistic interpretabilty, very ambitious success looks something like:
Have some decomposition of the model or the behavior of the model into parts.
For any given randomly selected part, you should almost always be able build up a very good understanding of this part in isolation.
By “very good” I mean that the understanding accounts for 90% of the bits of optimization applied to this part (where the remaining bits aren’t predictably more or less important per bit than what you’ve understood).
Roughly speaking, if your understanding accounts for 90% of the bits of optization for a AI than it means you should be able to construct a AI which works as well as if the original AI was only trained with 90% of the actual training compute.
In terms of loss explained, this is probably very high, like well above 99%.
The length of the explanation of all parts is probably only up to 1000 times shorter in bits than the size of the model. So, for a 1 trillion parameter model it’s at least 100 million words or 200,000 pages (assuming 10 bits per word). The compression comes from being able to use human concepts, but this will only get you so much.
Given your ability to explain any given part, build an overall understanding by piecing things together. This could be implicitly represented.
Be able to query understanding to answer interesting questions.
I don’t think there is an obvious easier road for mech interp to answer questions like “is the model deceptively aligned” if you want the approach to compete with much simpler high level and top down interpretability.
- 9 Nov 2023 4:15 UTC; 3 points) 's comment on Jeremy Gillen’s Shortform by (
Examples:
It seems to conflate scaling pauses (which aren’t clearly very useful) with pausing all AI related progress (hardware, algorithmic development, software). Many people think that scaling pauses aren’t clearly that useful due to overhang issues, but hardware pauses are pretty great. However, hardware development and production pauses would clearly be extremely difficult to implement. IMO the sufficient pause AI ask is more like “ask nvidia/tsmc/etc to mostly shut down” rather than “ask AGI labs to pause”.
More generally, the exact type of pause which would actually be better than (e.g.) well implemented RSPs is a non-trivial technical problem which makes this complex to communicate. I think this is a major reason why people don’t say stuff like “obviously, a full pause with XYZ characteristics would be better”. For instance, if I was running the US, I’d probably slow down scaling considerably, but I’d mostly be interested in implementing safety standards similar to RSPs due to lack of strong international coordination.
The post says “many people believe” a “pause is necessary” claim[1], but the exact claim you state probably isn’t actually believed by the people you cite below without additional complications. Like what exact counterfactuals are you comparing? For instance, I think that well implemented RSPs required by a regulatory agency can reduce risk to <5% (partially by stopping in worlds where this appears needed). So as an example, I don’t believe a scaling pause is necessary and other interventions would probably reduce risk more (while also probably being politically easier). And, I think a naive “AI scaling pause” doesn’t reduce risk that much, certainly less than a high quality US regulatory agency which requires and reviews something like RSPs. When claiming “many people believe”, I think you should make a more precise claim that the people you name actually believe.
Calling something a “pragmatic middle ground” doesn’t imply that there aren’t better options (e.g., shut down the whole hardware industry).
For instance, I don’t think it’s “lying” when people advocate for partial reductions in nuclear arms without noting that it would be better to secure sufficient international coordination to guarantee world peace. Like world peace would be great, but idk if it’s necessary to talk about. (There is probably less common knowledge in the AI case, but I think this example mostly holds.)
This post says “When called out on this, most people we talk to just fumble.”. I strongly predict that the people actually mentioned in the part above this (Open Phil, Paul, ARC evals, etc) don’t actually fumble and have a reasonable response. So, I think this misleadingly conflates the responses of two different groups at best.
More generally, this post seems to claim people have views that I don’t actually think they have and assumes the motives for various actions are powerseeking without any evidence for this.
The use of the term lying seems like a case of “noncentral fallacy” to me. The post presupposes a communication/advocacy norm and states violations of this norm should be labeled “lying”. I’m not sure I’m sold on this communication norm in the first place. (Edit: I think “say the ideal thing” shouldn’t be a norm (something where we punish people who violate this), but it does seem probably good in many cases to state the ideal policy.)
The exact text from the post is: