Thane Ruthenis comments on ryan_greenblatt’s Shortform

Thane Ruthenis 23 May 2025 21:51 UTC
LW: 28 AF: 12
25
AF
I’d been pretty much assuming that AGI labs’ “responsible scaling policies” are LARP/PR, and that if an RSP ever conflicts with their desire to release a model, either the RSP will be swiftly revised, or the testing suite for the model will be revised such that it doesn’t trigger the measures the AGI lab doesn’t want to trigger. I. e.: that RSPs are toothless and that their only purposes are to showcase how Responsible the lab is and to hype up how powerful a given model ended up.
This seems to confirm that cynicism.
(The existence of the official page tracking the updates is a (smaller) update in the other direction, though. I don’t see why they’d have it if they consciously intended to RSP-hack this way.)
- ryan_greenblatt 23 May 2025 23:32 UTC
  LW: 48 AF: 27
  29
  AF Parent
  Employees at Anthropic don’t think the RSP is LARP/PR. My best guess is that Dario doesn’t think the RSP is LARP/PR.
  
  This isn’t necessarily in conflict with most of your comment.
  
  I think I mostly agree the RSP is toothless. My sense is that for any relatively subjective criteria, like making a safety case for misalignment risk, the criteria will basically come down to “what Jared+Dario think is reasonable”. Also, if Anthropic is unable to meet this (very subjective) bar, then Anthropic will still basically do whatever Anthropic leadership thinks is best whether via maneuvering within the constraints of the RSP commitments, editing the RSP in ways which are defensible, or clearly substantially loosening the RSP and then explaining they needed to do this due to other actors having worse precautions (as is allowed by the RSP). I currently don’t expect clear cut and non-accidental procedural violations of the RSP (edit: and I think they’ll be pretty careful to avoid accidental procedural violations).
  
  I’m skeptical of normal employees having significant influence on high stakes decisions via pressuring the leadership, but empirical evidence could change the views of Anthropic leadership.
  
  How you feel about this state of affairs depends a lot on how much you trust Anthropic leadership to make decisions which are good from your perspective.
  
  Minimally it’s worth noting that Dario and Jared are much less concerned about misalignment risk than I am and I expect only partial convergence in beliefs due to empirical evidence (before it’s too late).
  
  I think the RSP still has a few important purposes:
  - I expect that the RSP will eventually end up with some transparency commitments with some teeth. These won’t stop Anthropic from proceeding if Anthropic leadership thinks this is best, but it might at least mean there ends up being common knowledge of whether reasonable third parties (or Anthropic leadership) think the current risk is large.
  - I think the RSP might end up with serious security requirements. I don’t expect these will be met on time in short timelines but the security bar specified in advance might at least create some expectations about what a baseline security expectation would be.
  - Anthropic might want to use the RSP the bind itself to the mast so that investors or other groups have a harder time pressuring it to spend less on security/safety.
  There are some other more tenative hopes (e.g., eventually getting common expectations of serious security or safety requirements which are likely to be upheld, regulation) which aren’t impossible.
  
  And there are some small wins already, like Google Deepmind having set some security expectations for itself which it is reasonably likely to follow through with if it isn’t too costly.
  - ryan_greenblatt 25 May 2025 0:18 UTC
    LW: 18 AF: 10
    2
    AF Parent
    
    How you feel about this state of affairs depends a lot on how much you trust Anthropic leadership to make decisions which are good from your perspective.
    
    Another note: My guess is that people on LessWrong tend to be overly pessimistic about Anthropic leadership (in terms of how good of decisions Anthropic leadership will make under the LessWrong person’s views and values) and Anthropic employees tend to be overly optimistic.
    
    I’m less confident that people on LessWrong are overly pessimistic, but they at least seem too pessimistic about the intentions/virtue of Anthropic leadership.
    - simeon_c 25 May 2025 12:26 UTC
      12 points
      1
      Parent
      For the record, I think the importance of “intentions”/values of leaders of AGI labs is overstated. What matters the most in the context of AGI labs is the virtue / power-seeking trade-offs, i.e. the propensity to do dangerous moves (/burn the commons) to unilaterally grab more power (in pursuit of whatever value).
      
      Stuff like this op-ed, broken promise of not meaningfully pushing the frontier, Anthropic’s obsession & single focus on automating AI R&D, Dario’s explicit calls to be the first to RSI AI or Anthropic’s shady policy activity has provided ample evidence that their propensity to burn the commons to grab more power (probably in name of some values I would mostly agree with fwiw) is very high.
      
      As a result, I’m now all-things-considered trusting Google DeepMind slightly more than Anthropic to do what’s right for AI safety. Google, as a big corp, is less likely to do unilateral power grabbing moves (such as automating AI R&D asap to achieve a decisive strategic advantage), is more likely to comply with regulations, and is already fully independent to build AGI (compute / money / talent) so won’t degrade further in terms of incentives; additionally D. Hassabis has been pretty consistent in his messaging about AI risks & AI policy, about the need for an IAEA/CERN for AI etc., Google has been mostly scaling up its safety efforts and has produced some of the best research on AI risk assessment (e.g. this excellent paper, or this one).
      - ryan_greenblatt 25 May 2025 18:46 UTC
        4 points
        2
        Parent
        IMO, reasonableness and epistemic competence are also key factors. This includes stuff like how effectively they update on evidence, how much they are pushed by motivated reasoning, how good are they at futurism and thinking about what will happen. I’d also include “general competence”.
        
        (This is a copy of my comment made on your shortform version of this point.)
    - Ben Pace 25 May 2025 19:06 UTC
      LW: 11 AF: 5
      9
      AF Parent
      Not the main thrust of the thread, but for what it’s worth, I find it somewhat anti-helpful to flatten things into a single variable of “how much you trust Anthropic leadership to make decisions which are good from your perspective”, and then ask how optimistic/pessimistic you are about this variable.
      I think I am much more optimistic about Anthropic leadership on many axis relative to an overall survey of the US population or Western population – I expect them to be more libertarian, more in favor of free speech, more pro economic growth, more literate, more self-aware, higher IQ, and a bunch of things.
      I am more pessimistic about their ability to withstand the pressures of a trillion dollar industry to shape their incentives than the people who are at Anthropic.
      I believe the people working there are siloing themselves intellectually into an institution facing incredible financial incentives for certain bottom lines like “rapid AI progress is inevitable” and “it’s reasonably likely we can solve alignment” and “beating China in the race is a top priority”, and aren’t allowed to talk to outsiders about most details of their work, and this is a key reason that I expect them to screw up their decision-making.
      I am optimistic about their relative-ability to have a sensible conversation about the next 5 years and what alignment failures look like, relative to most people on earth. This is not the standard I require to expect people to not do ML training runs that lead to human extinction, but nonetheless I predict they will do relatively quite well on this axis.
      I don’t have a single variable here, I have a much more complicated model than this. It looks to me that collapsing questions of trust about people or groups into a single varibale of how optimistic I am about them making decisions which are good from my values has been a common question-substitution in the Effective Altruism scene, where I think people have been repeatedly hoodwinked by sociopaths due to not moving toward a more detailed model that predicts exactly where and when someone will make good vs bad decisions.
      - ryan_greenblatt 25 May 2025 19:27 UTC
        LW: 4 AF: 4
        2
        AF Parent
        I certainly agree that the pressures and epistemic environment should make you less optimistic about good decisions being made. And that thinking through the overall situation and what types or decisions you care about are important. (Like, you can think of my comment as making a claim about the importance weighted goodness of decisions.)
        
        I don’t see the relevance of “relative decision making goodness compared to the general population” which I think you agree with, but in that case I don’t see what this was responding to.
        
        Not sure I agree with other aspects of this comment and implications. Like, I think reducing things to a variable like “how good is it to generically empowering this person/group” is pretty reasonable in the case of Anthropic leadership because in a lot of cases they’d have a huge amount of general open ended power, though a detailed model (taking into account what decisions you care about etc) would need to feed into this.
        Ben Pace 25 May 2025 20:05 UTC
        LW: 2 AF: 2
        0
        AF Parent
        What’s an example decision or two where you would want to ask yourself whether they should get more or less open-ended power? I’m not sure what you’re thinking of.
        ryan_greenblatt 25 May 2025 20:31 UTC
        LW: 2 AF: 2
        0
        AF Parent
        How good/bad is it to work on capabilities at Anthropic?
        
        That’s the most clear cut case, but lots of stuff trades off anthropic power with other stuff.
        Ben Pace 26 May 2025 3:48 UTC
        LW: 15 AF: 6
        10
        AF Parent
        I think the main thing I want to convey is that I think you’re saying that LWers (of which I am one) have a very low opinion of the integrity of people at Anthropic, but what I’m actually saying that their integrity is no match for the forces that they are being tested with.
        I don’t need to be able to predict a lot of fine details about individuals’ decision-making in order to be able to have good estimates of these two quantities, and comparing them is the second-most question relating to whether it’s good to work on capabilities at Anthropic. (The first one is a basic ethical question about working on a potentially extinction-causing technology that is not much related to the details of which capabilities company you’re working on.)
        ryan_greenblatt 26 May 2025 4:12 UTC
        LW: 2 AF: 2
        0
        AF Parent
        
        I think you’re saying that LWers (of which I am one) have a very low opinion of the integrity of people at Anthropic
        
        This is related to what I was saying but it wasn’t what I was saying. I was saying “tend to be overly pessimistic about Anthropic leadership (in terms of how good of decisions Anthropic leadership will make under the LessWrong person’s views and values)”. I wasn’t making a claim about the perceived absolute level of integrity.
        
        Probably not worth hashing this out further, I think I get what you’re saying.
  - Thane Ruthenis 23 May 2025 23:50 UTC
    LW: 16 AF: 7
    4
    AF Parent
    Employees at Anthropic don’t think the RSP is LARP/PR. My best guess is that Dario doesn’t think the RSP is LARP/PR.
    Yeah, I don’t think this is necessarily in contradiction with my comment. Things can be effectively just LARP/PR without being consciously LARP/PR. (Indeed, this is likely the case in most instances of LARP-y behavior.)
    Agreed on the rest.