Matthew Barnett comments on Matthew Barnett’s Shortform

Matthew Barnett 26 Jan 2024 19:09 UTC
2 points
0
I think the probability of “prior to total human obsolescence, AIs will be seriously misaligned, broadly strategic about achieving long run goals in ways that lead to scheming, and present a basically unified front (at least in the context of AIs within a single AI lab)” is “only” about 10-20% likely, but this is plausibly the cause of about half of misalignment related risk prior to human obsolescence.
I’d want to break apart this claim into pieces. Here’s a somewhat sketchy and wildly non-robust evaluation of how I’d rate these claims:
Assuming the claims are about most powerful AIs in the world...
“prior to total human obsolescence...
- “AIs will be seriously misaligned”
  - If “seriously misaligned” means “reliably takes actions intended to cause the ruin and destruction of the world in the near-to-medium term (from our perspective)”, I’d rate this as maybe 5% likely
  - If “seriously misaligned” means “if given full control over the entire world along with godlike abilities, would result in the destruction of most things I care about due to extremal goodhart and similar things” I’d rate this as 50% likely
- “broadly strategic about achieving long run goals in ways that lead to scheming”
  - I’d rate this as 65% likely
- “present a basically unified front (at least in the context of AIs within a single AI lab)”
  - For most powerful AIs, I’d rate this as 15% likely
  - For most powerful AIs within the top AI lab I’d rate this as 25% likely
- Conjunction of all these claims:
  - Taking the conjunction of the strong interpretation of every claim: 3% likely?
  - Taking a relatively charitable weaker interpretation of every claim: 20% likely
It’s plausible we don’t disagree much about the main claims here and mainly disagree instead about:
1. The relative value of working on technical misalignment compared to other issues
2. The relative likelihood of non-misalignment problems relative to misalignment problems
3. The amount of risk we should be willing to tolerate during the deployment of AIs
- ryan_greenblatt 26 Jan 2024 20:14 UTC
  2 points
  0
  Parent
  Are you conditioning on the prior claims when stating your probabilities? Many of these properties are highly correlated. E.g., “seriously misaligned” and “broadly strategic about achieving long run goals in ways that lead to scheming” seem very correlated to me. (Your probabilites seem higher than I would have expected without any correlation, but I’m unsure.)
  
  I think we probably disagree about the risk due to misalignment by like a factor of 2-4 or something. But probably more of the crux is in value on working on other problems.
  - Matthew Barnett 26 Jan 2024 21:14 UTC
    6 points
    0
    Parent
    I’m not conditioning on prior claims.
    
    One potential reason why you might have inferred that I was is because my credence for scheming is so high, relative to what you might have thought given my other claim about “serious misalignment”. My explanation here is that I tend to interpret “AI scheming” to be a relatively benign behavior, in context. If we define scheming as:
    
    behavior intended to achieve some long-tern objective that is not quite what the designers had in mind
    
    not being fully honest with the designers about its true long-term objectives (especially in the sense of describing accurately what it would do with unlimited power)
    
    then I think scheming is ubiquitous and usually relatively benign, when performed by rational agents without godlike powers. For example, humans likely “scheme” all the time by (1) pursuing long-term plans, and (2) not being fully honest to others about what they would do if they became god. This is usually not a big issue because agents don’t generally get the chance to take over the world and do a treacherous turn; instead, they have to play the game of compromise and trade like the rest of us, along with all the other scheming AIs, who have different long-term goals.