insofar as human capabilities are not largely explained by consequentialist planning (as suggested by the approval reward picture), this should make us more optimistic about human-level AGI alignment.
further, this picture might suggest that the cheapest way to human level-level AGI might route through approval reward-like mechanisms, giving us a large negative alignment tax.
ofc you might think getting approval reward to work is actually a very narrow target, and even if early human-level AGIs aren’t coherent consequentialists, they will use some other mechanism for learning that doesn’t route through approval reward and thus doesn’t inherit the potentially nice alignment properties (or you could think that early human-level AGIs will be more like coherent consequentialists).
…human capabilities are not largely explained by consequentialist planning…
I think I disagree with this. I would instead say something like: “Humans are the least intelligent species capable of building a technological civilization; but to the extent that humans have capabilities relevant to that, those capabilities CAN generally be explained by consequentialist planning; the role of Approval Reward is more about what people want than how capable they are of getting it.”
Note that, in this post I’m mostly focusing on the median human, who I claim spends a great deal of their life in simulacrum level 3. I’m not centrally talking about humans who are nerds, or unusually “agential”, etc., a category that includes most successful scientists, company founders, etc. If everyone was doing simulacrum 3 all the time, I don’t think humans would have invented science and technology. Maybe related: discussion of “sapient paradox” here.
insofar as human capabilities are not largely explained by consequentialist planning (as suggested by the approval reward picture), this should make us more optimistic about human-level AGI alignment.
further, this picture might suggest that the cheapest way to human level-level AGI might route through approval reward-like mechanisms, giving us a large negative alignment tax.
ofc you might think getting approval reward to work is actually a very narrow target, and even if early human-level AGIs aren’t coherent consequentialists, they will use some other mechanism for learning that doesn’t route through approval reward and thus doesn’t inherit the potentially nice alignment properties (or you could think that early human-level AGIs will be more like coherent consequentialists).
I think I disagree with this. I would instead say something like: “Humans are the least intelligent species capable of building a technological civilization; but to the extent that humans have capabilities relevant to that, those capabilities CAN generally be explained by consequentialist planning; the role of Approval Reward is more about what people want than how capable they are of getting it.”
Note that, in this post I’m mostly focusing on the median human, who I claim spends a great deal of their life in simulacrum level 3. I’m not centrally talking about humans who are nerds, or unusually “agential”, etc., a category that includes most successful scientists, company founders, etc. If everyone was doing simulacrum 3 all the time, I don’t think humans would have invented science and technology. Maybe related: discussion of “sapient paradox” here.
yeah I think I agree with all that… (like psychopath can definitely learn language, accomplish things in the world, etc)
maybe the thought experiment with the 18yr old just prompted me to think about old arguments around “the consequentialist core” that aren’t centrally about approval reward (and more about whether myopic rewards can elicit consequentialist-ish and aligned planning).