I’ve certainly considered this—and I’m pretty sure I got the idea from Eliezer_2001. He has some made-up phrase that ends in ‘semantics’ that means “figure out what makes people do what they do, find the part that looks moral, and do that.”
The main trouble with the straight-up interpretation is that humans don’t so much have a morality as we have a treasure map for finding morality, and modeling us as utiliity-maximizers doesn’t capture this well. Which over the long term is pretty undesirable—it would be like if the ancient Greeks built an AI and it still had the preconceptions of the ancient Greeks. So either you can pour tons of resources into modeling humans as utility-maximizers, possibly hitting overfitting problems (that is, to actually get a utility function over histories rather than word states, you always get some troublesome utilities for situations humans haven’t experienced yet, which have more to do with the model you use than any properties of humans), or you can use a different abstraction. E.g. find some way of representing “treasure map” algorithms where it makes sense to add them together.
I’ve certainly considered this—and I’m pretty sure I got the idea from Eliezer_2001. He has some made-up phrase that ends in ‘semantics’ that means “figure out what makes people do what they do, find the part that looks moral, and do that.”
The main trouble with the straight-up interpretation is that humans don’t so much have a morality as we have a treasure map for finding morality, and modeling us as utiliity-maximizers doesn’t capture this well. Which over the long term is pretty undesirable—it would be like if the ancient Greeks built an AI and it still had the preconceptions of the ancient Greeks. So either you can pour tons of resources into modeling humans as utility-maximizers, possibly hitting overfitting problems (that is, to actually get a utility function over histories rather than word states, you always get some troublesome utilities for situations humans haven’t experienced yet, which have more to do with the model you use than any properties of humans), or you can use a different abstraction. E.g. find some way of representing “treasure map” algorithms where it makes sense to add them together.