Research Scientist at DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/
Rohin Shah(Rohin Shah)
I expect that, absent impressive levels of international coordination, we’re screwed.
This is the sort of thing that makes it hard for me to distinguish your argument from “[regardless of the technical work you do] there will always be some existentially risky failures left, so if we proceed we will get doom. Therefore, we should avoid solving some failures, because those failures could help build political will to shut it all down”.
I agree that, conditional on believing that we’re screwed absent huge levels of coordination regardless of technical work, then a lot of technical work including debate looks net negative by reducing the will to coordinate.
What kinds of people are making/influencing key decisions in worlds where we’re likely to survive?
[...]
I don’t think conditioning on the status-quo free-for-all makes sense, since I don’t think that’s a world where our actions have much influence on our odds of success.
Similarly this only makes sense under a view where technical work can’t have much impact on p(doom) by itself, aka “regardless of technical work we’re screwed”. Otherwise even in a “free-for-all” world, our actions do influence odds of success, because you can do technical work that people use, and that reduces p(doom).
I’m only keen on specifications that plausibly give real guarantees: level 6(?) or 7. I’m only keen on the framework conditional on meeting an extremely high bar for the specification.
Oh, my probability on level 6 or level 7 specifications becoming the default in AI is dominated by my probability that I’m somehow misunderstanding what they’re supposed to be. (A level 7 spec for AGI seems impossible even in theory, e.g. because it requires solving the halting problem.)
If we ignore the misunderstanding part then I’m at << 1% probability on “we build transformative AI using GSA with level 6 / level 7 specifications in the nearish future”.
(I could imagine a pause on frontier AI R&D, except that you are allowed to proceed if you have level 6 / level 7 specifications; and those specifications are used in a few narrow domains. My probability on that is similar to my probability on a pause.)
Not going to respond to everything, sorry, but a few notes:
It fits the pattern of [lower perceived risk] --> [actions that increase risk].
My claim is that for the things you call “actions that increase risk” that I call “opportunity cost”, this causal arrow is very weak, and so you shouldn’t think of it as risk compensation.
E.g. presumably if you believe in this causal arrow you should also believe [higher perceived risk] --> [actions that decrease risk]. But if all building-safe-AI work were to stop today, I think this would have very little effect on how fast the world pushes forward with capabilities.
However, I think people are too ready to fall back on the best reference classes they can find—even when they’re terrible.
I agree that reference classes are often terrible and a poor guide to the future, but often first-principles reasoning is worse (related: 1, 2).
I also don’t really understand the argument in your spoiler box. You’ve listed a bunch of claims about AI, but haven’t spelled out why they should make us expect large risk compensation effects, which I thought was the relevant question.
Quantify “it isn’t especially realistic”—are we talking [15% chance with great effort], or [1% chance with great effort]?
It depends hugely on the specific stronger safety measure you talk about. E.g. I’d be at < 5% on a complete ban on frontier AI R&D (which includes academic research on the topic). Probably I should be < 1%, but I’m hesitant around such small probabilities on any social claim.
For things like GSA and ARC’s work, there isn’t a sufficiently precise claim for me to put a probability on.
Is [because we have a bunch of work on weak measures] not a big factor in your view? Or is [isn’t especially realistic] overdetermined, with [less work on weak measures] only helping conditional on removal of other obstacles?
Not a big factor. (I guess it matters that instruction tuning and RLHF exist, but something like that was always going to happen, the question was when.)
This characterization is a little confusing to me: all of these approaches (ARC / Guaranteed Safe AI / Debate) involve identifying problems, and, if possible, solving/mitigating them.
To the extent that the problems can be solved, then the approach contributes to [building safe AI systems];Hmm, then I don’t understand why you like GSA more than debate, given that debate can fit in the GSA framework (it would be a level 2 specification by the definitions in the paper). You might think that GSA will uncover problems in debate if they exist when using it as a specification, but if anything that seems to me less likely to happen with GSA, since in a GSA approach the specification is treated as infallible.
Main points:
There’ll always be some risk of existential failure.
I am saying “we might get doom”
I am not saying “we should not do safety work”
I’m on board with these.
I’m saying “risk compensation needs to be a large factor in deciding which safety work to do”
I still don’t see why you believe this. Do you agree that in many other safety fields, safety work mostly didn’t think about risk compensation, and still drove down absolute risk? (E.g. I haven’t looked into it but I bet people didn’t spend a bunch of time thinking about risk compensation when deciding whether to include seat belts in cars.)
If you do agree with that, what makes AI different from those cases? (The arguments you give seem like very general considerations that apply to other fields as well.)
Possibly many researchers do this, but don’t have any clean, legible way to express their process/conclusions. I don’t get that impression: my impression is that arguments along the lines I’m making tend to be perceived as fully general counter-arguments and dismissed (whether they come from outside, or from the researchers themselves).
I’d say that the risk compensation argument as given here Proves Too Much and implies that most safety work in most previous fields was net negative, which seems clearly wrong to me. It’s true that as a result I don’t spend lots of time thinking about risk compensation; that still seems correct to me.
It might be viable to re-imagine risk-management such that this is handled.
It seems like your argument here, and in other parts of your comment, is something like “we could do this more costly thing that increases safety even more”. This seems like a pretty different argument; it’s not about risk compensation (i.e. when you introduce safety measures, people do more risky things), but rather about opportunity cost (i.e. when you introduce weak safety measures, you reduce the will to have stronger safety measures). This is fine, but I want to note the explicit change in argument; my earlier comment and the discussion above was not trying to address this argument.
Briefly on opportunity cost arguments, the key factors are (a) how much will is there to pay large costs for safety, (b) how much time remains to do the necessary research and implement it, and (c) how feasible is the stronger safety measure. I am actually more optimistic about both (a) and (b) than what I perceive to be the common consensus amongst safety researchers at AGI labs, but tend to be pretty pessimistic about (c) (at least relative to many LessWrongers, I’m not sure how it compares to safety researchers at AGI labs).
Anyway for now let’s just say that I’ve thought about these three factors and think it isn’t especially realistic to expect that we can get stronger safety measures, and as a result I don’t see opportunity cost as a big reason not to do the safety work we currently do.
I guess I’d want to reframe it as “This is a better process by which to build future powerful AI systems”, so as to avoid baking in a level of concreteness before looking at the problem.
Yeah, I’m not willing to do this. This seems like an instance of the opportunity cost argument, where you try to move to a paradigm that can enable stronger safety measures. See above for my response.
Similarly, the theory of change you cite for your examples seems to be “discovers or clarifies problems that shows that we don’t have a solution” (including for Guaranteed Safe AI and ARC theory, even though in principle those could be about building safe AI systems). So as far as I can tell, the disagreement is really that you think current work that tries to provide a specific recipe for building safe AI systems is net negative, and I think it is net positive.
Other smaller points:
See also Critch’s thoughts on the need for social models when estimating impact.
I certainly agree that it is possible for risk compensation to make safety work net negative, which is all I think you can conclude from that post (indeed the post goes out of its way to say it isn’t arguing for or against any particular work). I disagree that this effect is large enough to meaningfully change the decisions on what work we should do, given the specific work that we typically do (including debate).
if we set our evidential standards such that we don’t focus on vague/speculative/indirect/conceptual arguments
This is a weird hypothetical. The entire field of AI existential safety is focused on speculative, conceptual arguments. (I’m not quite sure what the standard is for “vague” and “indirect” but probably I’d include those adjectives too.)
I think [resolving uncertainty is] an important issue to notice when considering research directions
Why? It doesn’t seem especially action guiding, if we’ve agreed that it’s not high value to try to resolve the uncertainty (which is what I take away from your (1)).
Maybe you’re saying “for various reasons (e.g. unilateralist curse, wisdom of the crowds, coordinating with people with similar goals), you should treat the probability that your work is net negative as higher than you would independently assess, which can affect your prioritization”. I agree that there’s some effect here but my assessment is that it’s pretty small and doesn’t end up changing decisions very much.
I like both of the theories of change you listed, though for (1) I usually think about scaling till human obsolescence rather than superintelligence.
(Though imo this broad class of schemes plausibly scales to superintelligence if you eventually hand off the judge role to powerful AI systems. Though I expect we’ll be able to reduce risk further in the future with more research.)
I note here that this isn’t a fully-general counterargument, but rather a general consideration.
I don’t see why this isn’t a fully general counterargument to alignment work. Your argument sounds to me like “there will always be some existentially risky failures left, so if we proceed we will get doom. Therefore, we should avoid solving some failures, because those failures could help build political will to shut it all down”.
And it does seem to me like you have to be saying “we will get doom”, not “we might get doom”. If it were the latter then the obvious positive case is that by removing some of the failures we reduce p(doom). It could still turn out negative due to risk compensation, but I’d at least expect you to give some argument for expecting that (it seems like the prior on “due to risk compensation we should not do safety work” should be pretty low).
What’s an example of alignment work that you think is net positive with the theory of change “this is a better way to build future powerful AI systems”?
(I’m probably not going to engage with perspectives that say all current [alignment work towards building safer future powerful AI systems] is net negative, sorry. In my experience those discussions typically don’t go anywhere useful.)
On scalable oversight with weak LLMs judging strong LLMs
I agree Eliezer’s writing often causes people to believe incorrect things and there are many aspects of his discourse that I wish he’d change, including some of the ones you highlight. I just want to push back on the specific critique of “there are no coherence theorems”.
(In fact, I made this post because I too previously believed incorrect things along these lines, and those incorrect beliefs were probably downstream of arguments made by Eliezer or MIRI, though it’s hard to say exactly what the influences were.)
“nevertheless, many important and influential people in the AI safety community have mistakenly and repeatedly promoted the idea that there are such theorems.”
I responded on the EA Forum version, and my understanding was written up in this comment.
TL;DR: EJT and I both agree that the “mistake” EJT is talking about is that when providing an informal English description of various theorems, the important and influential people did not state all the antecedents of the theorems.
Unlike EJT, I think this is totally fine as a discourse norm, and should not be considered a “mistake”. I also think the title “there are no coherence theorems” is hyperbolic and misleading, even though it is true for a specific silly definition of “coherence theorem”.
Fwiw I’m also skeptical of how much we can conclude from these evals, though I think they’re way above the bar for “worthwhile to report”.
Another threat model you could care about (within persuasion) is targeted recruitment for violent ideologies. With that one too it’s plausible you’d want a more targeted eval, though I think simplicity, generality, and low cost are also reasonable things to optimize for in evals.
Good point; this makes it clearer that “deployment” means external deployment by default. But level 2 only mentions “internal access of the critical capability,” which sounds like it’s about misuse — I’m more worried about AI scheming and escaping when the lab uses AIs internally to do AI development.
You’re right: our deployment mitigations are targeted at misuse only because our current framework focuses on misuse. As we note in the “Future work” section, we would need to do more work to address risks from misaligned AI. We focused on risks from deliberate misuse initially because they seemed more likely to us to appear first.
E.g., much more of the action is in deciding exactly who to influence and what to influence them to do.
Are you thinking specifically of exfiltration here?
Persuasion can be used for all sorts of things if you are considering both misuse and misalignment, so if you are considering a specific threat model, I expect my response will be “sure, but there are other threat models where the ‘who’ and ‘what’ can be done by humans”.
Thanks for the detailed critique – I love that you actually read the document in detail. A few responses on particular points:
The document doesn’t specify whether “deployment” includes internal deployment.
Unless otherwise stated, “deployment” to us means external deployment – because this is the way most AI researchers use the term. Deployment mitigations level 2 discusses the need for mitigations on internal deployments. ML R&D will require thinking about internal deployments (and so will many of the other CCLs).
Some people get unilateral access to weights until the top level. This is disappointing. It’s been almost a year since Anthropic said it was implementing two-party control, where nobody can unilaterally access the weights.
I don’t think Anthropic meant to claim that two-party control would achieve this property. I expect anyone using a cloud compute provider is trusting that the provider will not access the model, not securing it against such unauthorized access. (In principle some cryptographic schemes could allow you to secure model weights even from your cloud compute provider, but I highly doubt people are doing that, since it is very expensive.)
Mostly they discuss developers’ access to the weights. This is disappointing. It’s important but lots of other stuff is important too.
The emphasis on weights access isn’t meant to imply that other kinds of mitigations don’t matter. We focused on what it would take to increase our protection against exfiltration. A lot of the example measures discussed in the RAND interim report aren’t discussed because we already do them. For example, Google already does the following from RAND Level 3: (a) develop an insider threat program and (b) deploy advanced red-teaming. (That’s not meant to be exhaustive, I don’t personally know the details here.)
No mention of evals during deployment (to account for improvements in scaffolding, prompting, etc.).
Sorry, that’s just poor wording on our part—“every 3 months of fine-tuning progress” was meant to capture that as well. Thanks for pointing this out!
Talking about plans like this is helpful. But with no commitments, DeepMind shouldn’t get much credit.
With the FSF, we prefer to try it out for a while and iron out any issues, particularly since the science is in early stages, and best practices will need to evolve as we learn more. But as you say, we are running evals even without official FSF commitments, e.g. the Gemini 1.5 tech report has dangerous capability evaluation results (see Section 9.5.2).
Given recent updates in AGI safety overall, I’m happy that GDM and Google leadership takes commitments seriously, and thinks carefully about which ones they are and are not willing to make. Including FSF, White House Commitments, etc.
- DeepMind’s “Frontier Safety Framework” is weak and unambitious by 18 May 2024 3:00 UTC; 157 points) (
- Companies’ safety plans neglect risks from scheming AI by 3 Jun 2024 15:00 UTC; 73 points) (
- DeepMind’s “Frontier Safety Framework” is weak and unambitious by 18 May 2024 3:00 UTC; 54 points) (EA Forum;
It’s interesting to look back at this question 4 years later; I think it’s a great example of the difficulty of choosing the right question to forecast in the first place.
I think it is still pretty unlikely that the criterion I outlined is met—Q2 on my survey still seems like a bottleneck. I doubt that AGI researchers would talk about instrumental convergence in the kind of conversation I outlined. But reading the motivation for the question, it sure seems like a question that reflected the motivation well would have resolved yes by now (probably some time in 2023), given the current state of discourse and the progress in the AI governance space. (Though you could argue that the governance space is still primarily focused on misuse rather than misalignment.)
I did quite deliberately include Q2 in my planned survey—I think it’s important that the people whom governments defer to in crafting policy understand the concerns, rather than simply voicing support. But I failed to notice that it is quite plausible (indeed, the default) for there to be a relatively small number of experts that understand the concerns in enough depth to produce good advice on policy, plus a large base of “voicing support” from other experts who don’t have that same deep understanding. This means that it’s very plausible that fraction defined in the question never gets anywhere close to 0.5, but nonetheless the AI community “agrees on the risk” to a sufficient degree that governance efforts do end up in a good place.
Because I don’t think this is realistically useful, I don’t think this at all reduces my probability that your techniques are fake and your models of interpretability are wrong.
Maybe the groundedness you’re talking about comes from the fact that you’re doing interp on a domain of practical importance?
??? Come on, there’s clearly a difference between “we can find an Arabic feature when we go looking for anything interpretable” vs “we chose from the relatively small set of practically important things and succeeded in doing something interesting in that domain”. I definitely agree this isn’t yet close to “doing something useful, beyond what well-tuned baselines can do”. But this should presumably rule out some hypotheses that current interpretability results are due to an extreme streetlight effect?
(I suppose you could have already been 100% confident that results so far weren’t the result of extreme streetlight effect and so you didn’t update, but imo that would just make you overconfident in how good current mech interp is.)
(I’m basically saying similar things as Lawrence.)
Sounds plausible, but why does this differentially impact the generalizing algorithm over the memorizing algorithm?
Perhaps under normal circumstances both are learned so fast that you just don’t notice that one is slower than the other, and this slows both of them down enough that you can see the difference?
Daniel Filan: But I would’ve guessed that there wouldn’t be a significant complexity difference between the frequencies. I guess there’s a complexity difference in how many frequencies you use.
Vikrant Varma: Yes. That’s one of the differences: how many you use and their relative strength and so on. Yeah, I’m not really sure. I think this is a question we pick out as a thing we would like to see future work on.
My pet hypothesis here is that (a) by default, the network uses whichever frequencies were highest at initialization (for which there is significant circumstantial evidence) and (b) the amount of interference differs significantly based on which frequencies you use (which in turn changes the quality of the logits holding parameter norm fixed, and thus changes efficiency).
In principle this can be tested by randomly sampling frequency sets, simulating the level of interference you’d get, using that to estimate the efficiency + critical dataset size for that grokking circuit. This gives you a predicted distribution over critical dataset sizes, which you could compare against the actual distribution.
Tbc there are other hypotheses too, e.g. perhaps different frequency sets are easier / harder to implement by the neural network architecture.
This suggestion seems less expressive than (but similar in spirit to) the “rescale & shift” baseline we compare to in Figure 9. The rescale & shift baseline is sufficient to resolve shrinkage, but it doesn’t capture all the benefits of Gated SAEs.
The core point is that L1 regularization adds lots of biases, of which shrinkage is just one example, so you want to localize the effect of L1 as much as possible. In our setup L1 applies to , so you might think of as “tainted”, and want to use it as little as possible. The only thing you really need L1 for is to deter the model from setting too many features active, i.e. you need it to apply to one bit per feature (whether that feature is on / off). The Heaviside step function makes sure we are extracting just that one bit, and relying on for everything else.
Thinking on this a bit more, this might actually reflect a general issue with the way we think about feature shrinkage; namely, that whenever there is a nonzero angle between two vectors of the same length, the best way to make either vector close to the other will be by shrinking it.
This was actually the key motivation for building this metric in the first place, instead of just looking at the ratio . Looking at the that would optimize the reconstruction loss ensures that we’re capturing only bias from the L1 regularization, and not capturing the “inherent” need to shrink the vector given these nonzero angles. (In particular, if we computed for Gated SAEs, I expect that would be below 1.)
I think the main thing we got wrong is that we accidentally treated as though it were . To the extent that was the main mistake, I think it explains why our results still look how we expected them to—usually is going to be close to 1 (and should be almost exactly 1 if shrinkage is solved), so in practice the error introduced from this mistake is going to be extremely small.
We’re going to take a closer look at this tomorrow, check everything more carefully, and post an update after doing that. I think it’s probably worth waiting for that—I expect we’ll provide much more detailed derivations that make everything a lot clearer.
Possibly I’m missing something, but if you don’t have , then the only gradients to and come from (the binarizing Heaviside activation function kills gradients from ), and so would be always non-positive to get perfect zero sparsity loss. (That is, if you only optimize for L1 sparsity, the obvious solution is “none of the features are active”.)
(You could use a smooth activation function as the gate, e.g. an element-wise sigmoid, and then you could just stick with from the beginning of Section 3.2.2.)
Okay, I think it’s pretty clear that the crux between us is basically what I was gesturing at in my first comment, even if there are minor caveats that make it not exactly literally that.