This doesn’t make sense to me, particularly since I believe that most people live in environments that is very much” in distribution”, and it is difficult for us to discuss misalignment without talking about extreme cases (as I described in the previous comment), or subtle cases (black swans?) that may not seem to matter.
I think you’re ignoring the [now bolded part] in “a particular human’s learning process + reward circuitry + “training” environment” and just focusing in the environment. Humans very often don’t optimize for their reward circuitry in their limbic system. If I gave you a button that killed everyone but maximized your reward circuitry every time you pressed it, most people wouldn’t press it (would you?). I do agree that if you pressed the button once, you would then want to press the button again, but not beforehand which is an inner-misalignment w/ respect to the reward circuitry. Though maybe you’d say the wirehead thing is an extreme case OOD?
By inner values, I mean terminal goals. Wanting dogs to be happy is not a terminal goal for most people, and I believe that given enough optimization pressure, the hypothetical dog-lover would abandon this goal to optimize for what their true terminal goal is.
I agree, but I’m bolding “most people” because you’re claiming there exist some people that would retain that value if scaled up(?) I think replace “dog-lover” w/ “family-lover” and there’s even more people. But I don’t think this is a disagreement between us?
My bad; I’ve updated the comment to clarify that I believe Quintin claims that solving / preventing inner misalignment is easier than one would expect given the belief that evolution’s failure at inner alignment is the most significant and informative evidence that inner alignment is hard.
Oh, I think inner-misalignment w/ respect to the reward circuitry is a good, positive thing that we want, so there’s the disconnect (usually misalignment is thought of as bad, and I’m not just mistyping). Human values are formed by inner-misalignment and they have lots of great properties such as avoiding ontological crises, valuing real world things (like diamond maximizer in the OP), and a subset of which cares for all of humanity. We can learn more about this process by focusing more on the “a particular human’s learning process + reward circuitry+ “training” environment” part, and less on the evolution part.
If we understand the underlying mechanisms behind human value formation through inner-misalignment w/ respect to the reward circuitry, then we might be able to better develop the theory of learning systems developing values, which includes AGI.
Oh, I think inner-misalignment w/ respect to the reward circuitry is a good, positive thing that we want, so there’s the disconnect (usually misalignment is thought of as bad, and I’m not just mistyping).
Yes, thank you: I didn’t notice that you were making that assumption. This conversation makes a lot more sense to me now.
Human values are formed by inner-misalignment and they have lots of great properties such as avoiding ontological crises, valuing real world things (like diamond maximizer in the OP), and a subset of which cares for all of humanity. We can learn more about this process by focusing more on the “a particular human’s learning process + reward circuitry + “training” environment” part, and less on the evolution part.
If we understand the underlying mechanisms behind human value formation through inner-misalignment w/ respect to the reward circuitry, then we might be able to better develop the theory of learning systems developing values, which includes AGI.
This seems to imply that the aim of this alignment proposal is to solve the alignment problem by aligning the inner values with that of the creators of the AI and bypassing the outer alignment problem. That is really interesting; I’ve updated in the direction of shard theory being more viable as an alignment strategy than I previously believed. I’m still confused about huge parts of it, but we can discuss it more elsewhere.
I think you’re ignoring the [now bolded part] in “a particular human’s learning process + reward circuitry + “training” environment” and just focusing in the environment. Humans very often don’t optimize for their reward circuitry in their limbic system. If I gave you a button that killed everyone but maximized your reward circuitry every time you pressed it, most people wouldn’t press it (would you?). I do agree that if you pressed the button once, you would then want to press the button again, but not beforehand which is an inner-misalignment w/ respect to the reward circuitry. Though maybe you’d say the wirehead thing is an extreme case OOD?
I agree, but I’m bolding “most people” because you’re claiming there exist some people that would retain that value if scaled up(?) I think replace “dog-lover” w/ “family-lover” and there’s even more people. But I don’t think this is a disagreement between us?
Oh, I think inner-misalignment w/ respect to the reward circuitry is a good, positive thing that we want, so there’s the disconnect (usually misalignment is thought of as bad, and I’m not just mistyping). Human values are formed by inner-misalignment and they have lots of great properties such as avoiding ontological crises, valuing real world things (like diamond maximizer in the OP), and a subset of which cares for all of humanity. We can learn more about this process by focusing more on the “a particular human’s learning process + reward circuitry + “training” environment” part, and less on the evolution part.
If we understand the underlying mechanisms behind human value formation through inner-misalignment w/ respect to the reward circuitry, then we might be able to better develop the theory of learning systems developing values, which includes AGI.
Yes, thank you: I didn’t notice that you were making that assumption. This conversation makes a lot more sense to me now.
This seems to imply that the aim of this alignment proposal is to solve the alignment problem by aligning the inner values with that of the creators of the AI and bypassing the outer alignment problem. That is really interesting; I’ve updated in the direction of shard theory being more viable as an alignment strategy than I previously believed. I’m still confused about huge parts of it, but we can discuss it more elsewhere.