Straw-EY: Complexity of value means you can’t just get the make-AI-care part to happen by chance; it’s a small target.
Straw-MB: Ok but now we have a very short message pointing to roughly human values: just have a piece of code that says “and now call GPT and ask it what’s good”. So now it’s a very small number of bits.
To which I say: “dial a random phone number and ask the person who answers what’s good” can also be implemented with a small number of bits. In order for GPT-4 to be a major optimistic update about alignment, we need some specific way to leverage GPT-4 to crack open part of the alignment problem, even though we presumably agree that phone-a-friend doesn’t crack open part of the alignment problem. (Nor does phone-your-neighborhood-moral-philosopher, or phone-Paul-Christiano.)
This is a bad analogy. Phoning a human fails dominantly because humans are less smart than the ASI they would be trying to wrangle. Contra, Yudkowsky has even said that were you to bootstrap human intelligence directly, there is a nontrivial shot that the result is good. This difference is load bearing!
This does get to the heart of the disagreement, which I’m going to try to badly tap out on my phone.
The old, MIRI-style framing was essentially: we are going to build an AGI out of parts that are not intrinsically grounded in human values, but rather good abstract reasoning, during execution of which human values will be accurately deduced, and as this is after the point of construction, we hit the challenge of formally specifying what properties we want to preserve without being able to point to those runtime properties at specification.
The newer, contrasting framing is essentially: we are going to bulld an AGI out of parts that already have strong intrinsic, conceptual-level understanding of the values we want them to preserve, and being able to directly point at those values is actually needle-moving towards getting a good outcome. This is hard to do right now, with poor interpretability and steerability of these systems, but is nonetheless a relevant component of a potential solution.
It’s more like calling a human who’s as smart as you are and directly plugged into your brain and in fact reusing your world model and train of thought directly to understand the implications of your decision. That’s a huge step up from calling a real human over the phone!
The reason the real human proposal doesn’t work is that
the humans you call will lack context on your decision
they won’t even be able to receive all the context
they’re dumber and slower than you so even if you really could write out your entire chain of thoughts and intuitions consulting them for every decision would be impractical
Note that none of these considerations apply to integrated language models!
To which I say: “dial a random phone number and ask the person who answers what’s good” can also be implemented with a small number of bits. In order for GPT-4 to be a major optimistic update about alignment, we need some specific way to leverage GPT-4 to crack open part of the alignment problem, even though we presumably agree that phone-a-friend doesn’t crack open part of the alignment problem. (Nor does phone-your-neighborhood-moral-philosopher, or phone-Paul-Christiano.)
This is a bad analogy. Phoning a human fails dominantly because humans are less smart than the ASI they would be trying to wrangle. Contra, Yudkowsky has even said that were you to bootstrap human intelligence directly, there is a nontrivial shot that the result is good. This difference is load bearing!
This does get to the heart of the disagreement, which I’m going to try to badly tap out on my phone.
The old, MIRI-style framing was essentially: we are going to build an AGI out of parts that are not intrinsically grounded in human values, but rather good abstract reasoning, during execution of which human values will be accurately deduced, and as this is after the point of construction, we hit the challenge of formally specifying what properties we want to preserve without being able to point to those runtime properties at specification.
The newer, contrasting framing is essentially: we are going to bulld an AGI out of parts that already have strong intrinsic, conceptual-level understanding of the values we want them to preserve, and being able to directly point at those values is actually needle-moving towards getting a good outcome. This is hard to do right now, with poor interpretability and steerability of these systems, but is nonetheless a relevant component of a potential solution.
It’s more like calling a human who’s as smart as you are and directly plugged into your brain and in fact reusing your world model and train of thought directly to understand the implications of your decision. That’s a huge step up from calling a real human over the phone!
The reason the real human proposal doesn’t work is that
the humans you call will lack context on your decision
they won’t even be able to receive all the context
they’re dumber and slower than you so even if you really could write out your entire chain of thoughts and intuitions consulting them for every decision would be impractical
Note that none of these considerations apply to integrated language models!