The theoretical argument can be found here: https://arxiv.org/abs/1712.05812 ; basically, “goals plus (ir)rationality” contains strictly more information than “full behaviour or policy”.
Humans have a theory of mind that allows us to infer the preferences and rationality of others (and ourselves) with a large amount of agreement from human to human. In computer science terms, we can take agent behaviour and add “labels” about the agent’s goals (“this human is ‘happy’ ”; “they have ‘failed’ to achieve their goal”, etc...).
But accessing this theory of mind is not trivial; we either have to define it explicitly, or point to where in the human mind it resides (or, most likely, a mixture of the two). One way or another, we need to give the AI enough labelled information that it can correctly infer this theory of mind—unlabelled information (ie pure observations) are not enough.
If we have access to the internals of the human brain, the task is easier, because we can point to various parts of it and say things like “this is a pleasure centre, this part is involved in retrieval of information, etc...”. We still need labelled information, but we can (probably) get away with less.
I think I understand now. My best guess is that if your proof was applied to my example the conclusion would be that my example only pushes the problem back. To specify human values via a method like I was suggesting, you would still need to specify the part of the algorithm that “feels like” it has values, which is a similar type of problem.
I think I hadn’t grokked that your proof says something about the space of all abstract value/knowledge systems whereas my thinking was solely about humans. As I understand it, an algorithm that picks out human values from a simulation of the human brain will correspondingly do worse on other types of mind.
The theoretical argument can be found here: https://arxiv.org/abs/1712.05812 ; basically, “goals plus (ir)rationality” contains strictly more information than “full behaviour or policy”.
Humans have a theory of mind that allows us to infer the preferences and rationality of others (and ourselves) with a large amount of agreement from human to human. In computer science terms, we can take agent behaviour and add “labels” about the agent’s goals (“this human is ‘happy’ ”; “they have ‘failed’ to achieve their goal”, etc...).
But accessing this theory of mind is not trivial; we either have to define it explicitly, or point to where in the human mind it resides (or, most likely, a mixture of the two). One way or another, we need to give the AI enough labelled information that it can correctly infer this theory of mind—unlabelled information (ie pure observations) are not enough.
If we have access to the internals of the human brain, the task is easier, because we can point to various parts of it and say things like “this is a pleasure centre, this part is involved in retrieval of information, etc...”. We still need labelled information, but we can (probably) get away with less.
I think I understand now. My best guess is that if your proof was applied to my example the conclusion would be that my example only pushes the problem back. To specify human values via a method like I was suggesting, you would still need to specify the part of the algorithm that “feels like” it has values, which is a similar type of problem.
I think I hadn’t grokked that your proof says something about the space of all abstract value/knowledge systems whereas my thinking was solely about humans. As I understand it, an algorithm that picks out human values from a simulation of the human brain will correspondingly do worse on other types of mind.