Many different training scenarios are teaching your AI the same instrumental lessons, about how to think in accurate and useful ways. Furthermore, those lessons are underwritten by a simple logical structure, much like the simple laws of arithmetic that abstractly underwrite a wide variety of empirical arithmetical facts about what happens when you add four people’s bags of apples together on a table and then divide the contents among two people.
But that attractor well? It’s got a free parameter. And that parameter is what the AGI is optimizing for. And there’s no analogously-strong attractor well pulling the AGI’s objectives towards your preferred objectives.
I agree that there is no “analogously-strong attractor well pulling the AGI’s objectives” toward what I want, but I’m not convinced that we need an analogously-strong attractor well. Suppose that the attractor well of “thinking in accurate and useful ways” has 100 attractor-strength units (because it is dictated by simple empirical facts and laws), while the attractor well of “being a nice AI” has X units of strength. We agree that X must be less than 100 because “niceness” is hard to train for various reasons.
Would I be correct in saying that you think X is less than 10 (or at least, dramatically weaker than 100 strength units)? I, OTOH, expect that X is something like 50 (with wide error margins), and I expect that the strength of the “being a nice AI” well is sufficiently strong enough to make my probability of loss of control below 50%. I think that RL feedback on virtuousness is “strong enough” to engender a nice AI. I think that depending on how strong you think the “niceness” attractor state is quite cruxy for assessing the probability of a sharp left turn.
FWIW, I don’t know enough about RL to have strong evidence for my view, but I struggle to understand why you are so confident that a sharp left turn will occur. I would really appreciate understanding your POV better! Thanks!
I agree that there is no “analogously-strong attractor well pulling the AGI’s objectives” toward what I want, but I’m not convinced that we need an analogously-strong attractor well. Suppose that the attractor well of “thinking in accurate and useful ways” has 100 attractor-strength units (because it is dictated by simple empirical facts and laws), while the attractor well of “being a nice AI” has X units of strength. We agree that X must be less than 100 because “niceness” is hard to train for various reasons.
Would I be correct in saying that you think X is less than 10 (or at least, dramatically weaker than 100 strength units)? I, OTOH, expect that X is something like 50 (with wide error margins), and I expect that the strength of the “being a nice AI” well is sufficiently strong enough to make my probability of loss of control below 50%. I think that RL feedback on virtuousness is “strong enough” to engender a nice AI. I think that depending on how strong you think the “niceness” attractor state is quite cruxy for assessing the probability of a sharp left turn.
FWIW, I don’t know enough about RL to have strong evidence for my view, but I struggle to understand why you are so confident that a sharp left turn will occur. I would really appreciate understanding your POV better! Thanks!