I think both (i) and (ii) are directionally correct. I had exactly this idea in mind when I wrote this draft.
Maybe I shouldn’t have used “Goals” as the term of art for this post, but rather “Traits?” or “Principles?” Or “Virtues.”
It sounds like you are a fan of Hypothesis 3? “Unintended version of written goals and/or human intentions.” Because it sounds like you are saying probably things will be wrong-but-not-totally-wrong relative to the Spec / dev intentions.
It seems to me that the consequentialist vs virtue-driven axis is mostly orthogonal to the hypotheses here.
As written, aren’t Hypothesis 1: Written goal specification, Hypothesis 2: Developer-intended goals, and Hypothesis 3: Unintended version of written goals and/or human intentions all compatible with either kind of AI?
Hypothesis 4: Reward/reinforcement does assume a consequentialist, and so does Hypothesis 5: Proxies and/or instrumentally convergent goals as written, although it seems like ‘proxy virtues’ could maybe be a thing too?
(Unrelatedly, it’s not that natural to me to group proxy goals with instrumentally convergent goals, but maybe I’m missing something).
Maybe I shouldn’t have used “Goals” as the term of art for this post, but rather “Traits?” or “Principles?” Or “Virtues.”
I probably wouldn’t prefer any of those to goals. I might use “Motivations”, but I also think it’s ok to use goals in this broader way and “consequentialist goals” when you want to make the distinction.
I also agree that the question of whether AIs will be driven (purely) by consequentialist goals or whether they will (to a significant extent) be constrained by deontological principles / virtues / etc. is an important question.
I think it’s downstream of the spread of hypotheses discussed in this post, such that we can make faster progress on it once we’ve made progress eliminating hypotheses from this list.
Like, suppose you think Hypothesis 1 is true: They’ll do whatever is in the Spec, because Constitutional AI or Deliberative Alignment or whatever Just Works. On this hypothesis, the answer to your question is “well, what does the Spec say? Does it just list a bunch of goals, or does it also include principles? Does it say it’s OK to overrule the principles for the greater good, or not?”
Meanwhile suppose you think Hypothesis 4 is true. Then it seems like you’ll be dealing with a nasty consequentialist, albeit hopefully a rather myopic one.
I think it’s downstream of the spread of hypotheses discussed in this post, such that we can make faster progress on it once we’ve made progress eliminating hypotheses from this list.
Fair enough, yeah—this seems like a very reasonable angle of attack.
I think both (i) and (ii) are directionally correct. I had exactly this idea in mind when I wrote this draft.
Maybe I shouldn’t have used “Goals” as the term of art for this post, but rather “Traits?” or “Principles?” Or “Virtues.”
It sounds like you are a fan of Hypothesis 3? “Unintended version of written goals and/or human intentions.” Because it sounds like you are saying probably things will be wrong-but-not-totally-wrong relative to the Spec / dev intentions.
It seems to me that the consequentialist vs virtue-driven axis is mostly orthogonal to the hypotheses here.
As written, aren’t Hypothesis 1: Written goal specification, Hypothesis 2: Developer-intended goals, and Hypothesis 3: Unintended version of written goals and/or human intentions all compatible with either kind of AI?
Hypothesis 4: Reward/reinforcement does assume a consequentialist, and so does Hypothesis 5: Proxies and/or instrumentally convergent goals as written, although it seems like ‘proxy virtues’ could maybe be a thing too?
(Unrelatedly, it’s not that natural to me to group proxy goals with instrumentally convergent goals, but maybe I’m missing something).
I probably wouldn’t prefer any of those to goals. I might use “Motivations”, but I also think it’s ok to use goals in this broader way and “consequentialist goals” when you want to make the distinction.
I agree it’s mostly orthogonal.
I also agree that the question of whether AIs will be driven (purely) by consequentialist goals or whether they will (to a significant extent) be constrained by deontological principles / virtues / etc. is an important question.
I think it’s downstream of the spread of hypotheses discussed in this post, such that we can make faster progress on it once we’ve made progress eliminating hypotheses from this list.
Like, suppose you think Hypothesis 1 is true: They’ll do whatever is in the Spec, because Constitutional AI or Deliberative Alignment or whatever Just Works. On this hypothesis, the answer to your question is “well, what does the Spec say? Does it just list a bunch of goals, or does it also include principles? Does it say it’s OK to overrule the principles for the greater good, or not?”
Meanwhile suppose you think Hypothesis 4 is true. Then it seems like you’ll be dealing with a nasty consequentialist, albeit hopefully a rather myopic one.
Fair enough, yeah—this seems like a very reasonable angle of attack.