Daniel Kokotajlo comments on What goals will AIs have? A list of hypotheses

Daniel Kokotajlo 3 Mar 2025 21:56 UTC
5 points
0
I think both (i) and (ii) are directionally correct. I had exactly this idea in mind when I wrote this draft.

Maybe I shouldn’t have used “Goals” as the term of art for this post, but rather “Traits?” or “Principles?” Or “Virtues.”

It sounds like you are a fan of Hypothesis 3? “Unintended version of written goals and/or human intentions.” Because it sounds like you are saying probably things will be wrong-but-not-totally-wrong relative to the Spec / dev intentions.
- mattmacdermott 3 Mar 2025 22:39 UTC
  4 points
  0
  Parent
  It seems to me that the consequentialist vs virtue-driven axis is mostly orthogonal to the hypotheses here.
  
  As written, aren’t Hypothesis 1: Written goal specification, Hypothesis 2: Developer-intended goals, and Hypothesis 3: Unintended version of written goals and/or human intentions all compatible with either kind of AI?
  
  Hypothesis 4: Reward/reinforcement does assume a consequentialist, and so does Hypothesis 5: Proxies and/or instrumentally convergent goals as written, although it seems like ‘proxy virtues’ could maybe be a thing too?
  
  (Unrelatedly, it’s not that natural to me to group proxy goals with instrumentally convergent goals, but maybe I’m missing something).
  
  Maybe I shouldn’t have used “Goals” as the term of art for this post, but rather “Traits?” or “Principles?” Or “Virtues.”
  
  I probably wouldn’t prefer any of those to goals. I might use “Motivations”, but I also think it’s ok to use goals in this broader way and “consequentialist goals” when you want to make the distinction.
  - Daniel Kokotajlo 3 Mar 2025 23:37 UTC
    4 points
    0
    Parent
    I agree it’s mostly orthogonal.
    I also agree that the question of whether AIs will be driven (purely) by consequentialist goals or whether they will (to a significant extent) be constrained by deontological principles / virtues / etc. is an important question.
    
    I think it’s downstream of the spread of hypotheses discussed in this post, such that we can make faster progress on it once we’ve made progress eliminating hypotheses from this list.
    
    Like, suppose you think Hypothesis 1 is true: They’ll do whatever is in the Spec, because Constitutional AI or Deliberative Alignment or whatever Just Works. On this hypothesis, the answer to your question is “well, what does the Spec say? Does it just list a bunch of goals, or does it also include principles? Does it say it’s OK to overrule the principles for the greater good, or not?”
    
    Meanwhile suppose you think Hypothesis 4 is true. Then it seems like you’ll be dealing with a nasty consequentialist, albeit hopefully a rather myopic one.
    - mattmacdermott 4 Mar 2025 8:36 UTC
      4 points
      2
      Parent
      
      I think it’s downstream of the spread of hypotheses discussed in this post, such that we can make faster progress on it once we’ve made progress eliminating hypotheses from this list.
      
      Fair enough, yeah—this seems like a very reasonable angle of attack.