Why do you expect it to be hard to specify given a model that knows the information you’re looking for? In general the core lesson of unsupervised learning is that often the best way to get pointers to something you have a limited specification for is to learn some other task that necessarily includes it then specialize to that subtask. Why should values be any different? Broadly, why should values be harder to get good pointers to than much more complicated real-world tasks?
How would you design a task that incentivizes a system to output its true estimates of human values? We don’t have ground truth for human values, because they’re mind states not behaviors.
Seems easier to create incentives for things like “wash dishes without breaking them”, you can just tell.
I think I can just tell a lot of stuff wrt human values! How do you think children infer them? I think in order for human values to not be viable to point to extensionally (ie by looking at a bunch of examples) you have to make the case that they’re much more built-in to the human brain than seems appropriate for a species that can produce both Jains and (Genghis Khan era) Mongols.
I’d also note that “incentivize” is probably giving a lot of the game away here—my guess is you can just pull them out much more directly by gathering a large dataset of human preferences and predicting judgements.
If you define “human values” as “what humans would say about their values across situations”, then yes, predicting “human values” is a reasonable training objective. Those just aren’t really what we “want” as agents, and agentic humans would have motives not to let the future be controlled by an AI optimizing for human approval.
That’s also not how I defined human values, which is based on the assumption that the human brain contains one or more expected utility maximizers. It’s possible that the objectives of these maximizers are affected by socialization, but they’ll be less affected by socialization than verbal statements about values, because they’re harder to fake so less affected by preference falsification.
Children learn some sense of what they’re supposed to say about values, but have some pre-built sense of “what to do / aim for” that’s affected by evopsych and so on. It seems like there’s a huge semantic problem with talking about “values” in a way that’s ambiguous between “in-built evopsych-ish motives” and “things learned from culture about what to endorse”, but Yudkowsky writing on complexity of value is clearly talking about stuff affected by evopsych. I think it was a semantic error for the discourse to use the term “values” rather than “preferences”.
In the section on subversion I made the case that terminal values make much more difference in subversive behavior than compliant behavior.
It seems like to get at the values of approximate utility maximizers located in the brain you would need something like Goal Inference as Inverse Planning rather than just predicting behavior.
Why do you expect it to be hard to specify given a model that knows the information you’re looking for? In general the core lesson of unsupervised learning is that often the best way to get pointers to something you have a limited specification for is to learn some other task that necessarily includes it then specialize to that subtask. Why should values be any different? Broadly, why should values be harder to get good pointers to than much more complicated real-world tasks?
How would you design a task that incentivizes a system to output its true estimates of human values? We don’t have ground truth for human values, because they’re mind states not behaviors.
Seems easier to create incentives for things like “wash dishes without breaking them”, you can just tell.
I think I can just tell a lot of stuff wrt human values! How do you think children infer them? I think in order for human values to not be viable to point to extensionally (ie by looking at a bunch of examples) you have to make the case that they’re much more built-in to the human brain than seems appropriate for a species that can produce both Jains and (Genghis Khan era) Mongols.
I’d also note that “incentivize” is probably giving a lot of the game away here—my guess is you can just pull them out much more directly by gathering a large dataset of human preferences and predicting judgements.
If you define “human values” as “what humans would say about their values across situations”, then yes, predicting “human values” is a reasonable training objective. Those just aren’t really what we “want” as agents, and agentic humans would have motives not to let the future be controlled by an AI optimizing for human approval.
That’s also not how I defined human values, which is based on the assumption that the human brain contains one or more expected utility maximizers. It’s possible that the objectives of these maximizers are affected by socialization, but they’ll be less affected by socialization than verbal statements about values, because they’re harder to fake so less affected by preference falsification.
Children learn some sense of what they’re supposed to say about values, but have some pre-built sense of “what to do / aim for” that’s affected by evopsych and so on. It seems like there’s a huge semantic problem with talking about “values” in a way that’s ambiguous between “in-built evopsych-ish motives” and “things learned from culture about what to endorse”, but Yudkowsky writing on complexity of value is clearly talking about stuff affected by evopsych. I think it was a semantic error for the discourse to use the term “values” rather than “preferences”.
In the section on subversion I made the case that terminal values make much more difference in subversive behavior than compliant behavior.
It seems like to get at the values of approximate utility maximizers located in the brain you would need something like Goal Inference as Inverse Planning rather than just predicting behavior.