A lot of my thinking over the last few months has shifted from “how do we get some sort of AI pause in place?” to “how do we win the peace?”. That is, you could have a picture of AGI as the most important problem that precedes all other problems; anti-aging research is important, but it might actually be faster to build an aligned artificial scientist who solves it for you than to solve it yourself (on this general argument, see Artificial Intelligence as a Positive and Negative Factor in Global Risk). But if alignment requires a thirty-year pause on the creation of artificial scientists to work, that belief flips—now actually it makes sense to go ahead with humans researching the biology of aging, and to do projects like Loyal.
This isn’t true of just aging; there are probably something more like twelve major areas of concern. Some of them are simply predictable catastrophes we would like to avert; others are possibly necessary to be able to safely exit the pause at all (or to keep the pause going when it would be unsafe to exit).
I think ‘solutionism’ is basically the right path, here. What I’m interested in: what’s the foundation for solutionism, or what support does it need? Why is solutionism not already the dominant view? I think one of the things I found most exciting about SENS was the sense that “someone had done the work”, had actually identified the list of seven problems, and had a plan of how to address all of the problems. Even if those specific plans didn’t pan out, the superstructure was there and the ability to pivot was there. It looked like a serious approach by serious people. What is the superstructure for solutionism such that one can be reasonably confident that marginal efforts are actually contributing to success, instead of bailing water on the Titanic?
Hearing, on my way out the door, when I’m exhausted beyond all measure and feeling deeply alienated and betrayed, “man, you should really consider sticking around” is upsetting.
This is not how I read Seth Herd’s comment; I read him as saying “aw, I’ll miss you, but not enough to follow you to Substack.” This is simultaneously support for you staying on LW and for the mods to reach an accommodation with you, intended as information for you to do what you will with it.
I think the rest of this—being upset about what you think is the frame of that comment—feels like it’s the conflict in miniature? I’m not sure I have much helpful to say, there.
My understanding is that their commitment is to stop once their ASL-3 evals are triggered.
Ok, we agree. By “beyond ASL-3” I thought you meant “stuff that’s outside the category ASL-3″ instead of “the first thing inside the category ASL-3”.
For the Anthropic RSP in particular, I think it’s accurate & helpful to say
Yep, that summary seems right to me. (I also think the “concrete commitments” statement is accurate.)
But I want to see RSP advocates engage more with the burden of proof concerns.
Yeah, I also think putting the burden of proof on scaling (instead of on pausing) is safer and probably appropriate. I am hesitant about it on process grounds; it seems to me like evidence of safety might require the scaling that we’re not allowing until we see evidence of safety. On net, it seems like the right decision on the current margin but the same lock-in concerns (if we do the right thing now for the wrong reasons perhaps we will do the wrong thing for the same reasons in the future) worry me about simply switching the burden of proof (instead of coming up with a better system to evaluate risk).
I got the impression that Anthropic wants to do the following things before it scales beyond ASL-3:
Did you mean ASL-2 here? This seems like a pretty important detail to get right. (What they would need to do to scale beyond ASL-3 is meet the standard of an ASL-4 lab, which they have not developed yet.)
I agree with Habryka that these don’t seem likely to cause Anthropic to stop scaling:
By design, RSPs are conditional pauses; you pause until you have met the standard, and then you continue. If you get the standard in place soon enough, you don’t need to pause at all. This incentivizes implementing the security and safety procedures as soon as possible, which seems good to me.
But the RSP does not commit Anthropic to having any particular containment measures or any particular evidence that it is safe to scale to ASL-4– it only commits Anthropic to publish a post about ASL-4 systems. This is why I don’t consider the ASL-4 section to be a concrete commitment.
Yes, I agree that the ASL-4 part is an IOU, and I predict that when they eventually publish it there will be controversy over whether or not they got it right. (Ideally, by then we’ll have a consensus framework and independent body that develops those standards, which Anthropic will just sign on to.)
Again, this is by design; the underlying belief of the RSP is that we can only see so far ahead thru the fog, and so we should set our guidelines bit-by-bit, rather than pausing until we can see our way all the way to an aligned sovereign.
Are you thinking that psychology-focused AI would notice the existence of their operators sooner than non-psychology AI? Or is it more about influence AI that people deliberately point at themselves instead of others?
I am mostly thinking about the former; I am worried that psychology-focused AI will develop more advanced theory of mind and be able to hide going rogue from operators/users more effectively, develop situational awareness more quickly, and so on.
I currently predict that the AI safety community is best off picking its battles and should not try to interfere with technologies that are as directly critical to national security as psychology AI is;
My view is that the AI takeover problem is fundamentally a ‘security’ problem. Building a robot army/police force has lots of benefits (I prefer it to a human one in many ways) but it means it’s that much easier for a rogue AI to seize control; a counter-terrorism AI also can be used against domestic opponents (including ones worried about the AI), and so on. I think jumping the gun on these sorts of things is more dangerous than jumping the gun on non-security uses (yes, you could use a fleet of self-driving cars to help you in a takeover, but it’d be much harder than a fleet of self-driving missile platforms).
FWIW I read Anthropic’s RSP and came away with the sense that they would stop scaling if their evals suggested that a model being trained either registered as ASL-3 or was likely to (if they scaled it further). They would then restart scaling once they 1) had a definition of the ASL-4 model standard and lab standard and 2) met the standard of an ASL-3 lab.
Do you not think that? Why not?
(I’m Matthew Gray)
Inflection is a late addition to the list, so Matt and I won’t be reviewing their AI Safety Policy here.
My sense from reading Inflection’s response now is that they say the right things about red teaming and security and so on, but I am pretty worried about their basic plan / they don’t seem to be grappling with the risks specific to their approach at all. Quoting from them in two different sections:
Inflection’s mission is to build a personal artificial intelligence (AI) for everyone. That means an AI that is a trusted partner: an advisor, companion, teacher, coach, and assistant rolled into one.
Internally, Inflection believes that personal AIs can serve as empathetic companions that help people grow intellectually and emotionally over a period of years or even decades.** Doing this well requires an understanding of the opportunities and risks that is grounded in long-standing research in the fields of psychology and sociology.** We are presently building our internal research team on these issues, and will be releasing our research on these topics as we enter 2024.
I think AIs thinking specifically about human psychology—and how to convince people to change their thoughts and behaviors—are very dual use (i.e. can be used for both positive and negative ends) and at high risk for evading oversight and going rogue. The potential for deceptive alignment seems quite high, and if Inflection is planning on doing any research on those risks or mitigation efforts specific to that, it doesn’t seem to have shown up in their response.
I don’t think this type of AI is very useful for closing the acute risk window, and so probably shouldn’t be made until much later.
What’s the probability associated with that “should”? The higher it is the less of a concern this point is, but I don’t think it’s high enough to write off this point. (Separately, agreed that in order for danger warnings to be useful, they also have to be good at evaluating the impact of mitigations unless they’re used to halt work entirely.)
I don’t think safety buffers are a good solution; I think they’re helpful but there will still always be a transition point between ASL-2 models and ASL-3 models, and I think it’s safer to have that transition in an ASL-3 lab than an ASL-2 lab. Realistically, I think we’re going to end up in a situation where, for example, Anthropic researchers put a 10% chance on the next 4x scaling leading to evals declaring a model ASL-3, and it’s not obvious what decision they will (or should) make in that case. Is 10% low enough to proceed, and what are the costs of being ‘early’?
The relevant section of the RSP:
Note that ASLs are defined by risk relative to baseline, excluding other advanced AI systems. This means that a model that initially merits ASL-3 containment and deployment measures for national security reasons might later be reduced to ASL-2 if defenses against national security risks (such as biological or cyber defenses) advance, or if dangerous information becomes more widely available. However, to avoid a “race to the bottom”, the latter should not include the effects of other companies’ language models; just because other language models pose a catastrophic risk does not mean it is acceptable for ours to.
I think it’s sensible to reduce models to ASL-2 if defenses against the threat become available (in the same way that it makes sense to demote pathogens from BSL-4 to BSL-3 once treatments become available), but I’m concerned about the “dangerous information becomes more widely available” clause. Suppose you currently can’t get slaughterbot schematics off Google; if those become available, I am not sure it then becomes ok for models to provide users with slaughterbot schematics. (Specifically, I don’t want companies that make models which are ‘safe’ except they leak dangerous information X to have an incentive to cause dangerous information X to become available thru other means.)
[There’s a related, slightly more subtle point here; supposing you can currently get instructions on how to make a pipe bomb on Google, it can actually reduce security for Claude to explain to users how to make pipe bombs if Google is recording those searches and supplying information to law enforcement / the high-ranked sites on Google search are honeypot sites and Anthropic is not. The baseline is not just “is the information available?” but “who is noticing you accessing the information?”.]
4. I mean, superior alternatives always preferred. I am moderately optimistic about “just stop” plans, and am not yet convinced that “scale until our tests tell us to stop” is dramatically superior to “stop now.”
(Like, I think the hope here is to have an AI summer while we develop alignment methods / other ways to make humanity more prepared for advanced AI; it is not clear to me that doing that with the just-below-ASL-3 model is all that much better than doing it with the ASL-2 models we have today.)
At minimum, I hope that RSPs get renamed, and that those communicating about RSPs are more careful to avoid giving off the impression that RSPs are sufficient.
OpenAI’s RDP name seems nicer than the RSP name, for roughly the reason they explain in their AI summit proposal (and also ‘risk-informed’ feels more honest than ‘responsible’):
We refer to our policy as a Risk-Informed Development Policy rather than a Responsible Scaling Policy because we can experience dramatic increases in capability without significant increase in scale, e.g., via algorithmic improvements.
Nobody has really done any amount of retroactive funding
Wasn’t this a retroactive funding thing?
So my view is that it is the decision-makers currently imagining that the poisoned banana will grant them increased wealth & power who need their minds changed.
My current sense is that efforts to reach the poisoned banana are mostly not driven by politicians. It’s not like Joe Biden or Xi Jinping are pushing for AGI, and even Putin’s comments on AI look like near-term surveillance / military stuff, not automated science and engineering.
unless some catastrophic but survivable casus belli happens to wake the population up
Popular support is already >70% for stopping development of AI. Why think that’s not enough, and that populations aren’t already awake?
AlexNet dates back to 2012, I don’t think previous work on AI can be compared to modern statistical AI.
When were convnets invented, again? How about backpropagation?
In the Manhattan project, there was no disagreement between the physicists, the politicians / generals, and the actual laborers who built the bomb, on what they wanted the bomb to do.
In that they wanted the bomb to explode? I think the analogous level of control for AI would be unsatisfactory.
they did so voluntarily and knowing they wouldn’t be the ones who got any say in whether and how it would be used.
I’m not sure they thought this; I think many expected that by playing along they would have influence later. Tech workers today often seem to care a lot about how products made by their companies are deployed.
The field is something like 5 years old.
I’m not sure what you are imagining as ‘the field’, but isn’t it closer to twenty years old? (Both numbers are, of course, much less than the age of the AI field, or of computer science more broadly.)
Much of the source of my worry is that I think in the first ten-twenty years of work on safety, we mostly got impossibility and difficulty results, and so “let’s just try and maybe it’ll be easy” seems inconsistent with our experience so far.
But even after that, Caroline didn’t turn on Sam yet.
The only sense in which it’s clear that it’s “for personal gain” is that it’s lying to get what you want.Sure, I’m with you that far—but if what someone wants is [a wonderful future for everyone], then that’s hardly what most people would describe as “for personal gain”.
If Alice lies in order to get influence, with the hope of later using that influence for altruistic ends, it seems far to call the influence Alice gets ‘personal gain’. After all, it’s her sense of altruism that will be promoted, not a generic one.
I have historically found myself with few allies in the comment section of the EA Forum for most of the history of EA, when I tried to stand up for letting people speak their mind and not be subject to crippling PR constraints.Also I have found myself supportive of people who I thought were in the EA ecosystem for the values and for doing what’s right, and then when I pointed out times that the ecosystem as a whole was not doing that, someone said to me “I try with a couple of percentage points of my time to help steer the egregore, but mostly I am busy with my work and life”, and I had the wind knocked out of my sails a bit.And yet I think it often has outsized effects and the costs aren’t that big.
I have historically found myself with few allies in the comment section of the EA Forum for most of the history of EA, when I tried to stand up for letting people speak their mind and not be subject to crippling PR constraints.
Also I have found myself supportive of people who I thought were in the EA ecosystem for the values and for doing what’s right, and then when I pointed out times that the ecosystem as a whole was not doing that, someone said to me “I try with a couple of percentage points of my time to help steer the egregore, but mostly I am busy with my work and life”, and I had the wind knocked out of my sails a bit.
And yet I think it often has outsized effects and the costs aren’t that big.
I think I never really responded to this but also it was probably the main generator of Ben’s opinion?
I’m not sure whether I would have said my initial “we” statement about EAs. (Part of this is just being less confident about what EA social dynamics are like; another is thinking they are less fractious than rationalists.)