I think you’re mostly right about the problem but the conclusion doesn’t follow.
First a nitpick: If you find out you’re being manipulated, your terminal values shouldn’t change (unless your mind is broken somehow, or not reflectively stable).
But there’s a similar issue: You could discover that all your previous observations were fed to you by a malicious demon, and nothing that you previously cared about actually exists. So your values don’t bind to anything in the new world you find yourself in.
In that situation, how do we want an AI to act? There’s a few options, but doing nothing seems like a good default. Constructing the value binding algorithm such that this is the resulting behaviour doesn’t seem that hard, but it might not be trivial.
Thanks! This is a reply to both you and @nowl, who I take to be (in part) making similar claims.
I could have been clearer in both my thinking and my wording.
Because of the is-ought gap, value (how I want the world to be) doesn’t inherently change in response to evidence (how the world is). [nowl]
While I agree with this as a philosophical stance, it seems clear that, at least in humans, facts and values are in practice entangled. As a classic example in fiction, if you’ve absorbed some terminal values from your sensei and then your sensei turns out to be the monster who murdered your parents, you’re likely to reject some of those values as a result.
But
there’s a similar issue: You could discover that all your previous observations were fed to you by a malicious demon, and nothing that you previously cared about actually exists. So your values don’t bind to anything in the new world you find yourself in.
This is closer to what I meant to point to — something like values-as-expressed-in-the-world. Even if the terminal value doesn’t change, there are facts that could result in behavior that, to a reasonable outside observer, seems to reflect a conflicting value.
For example: if we successfully instilled human welfare as a terminal value into an AI, there is (presumably) some set of observations which could convince the system that all the apparent humans it comes in contact with are actually aliens in human suits, who are holding all the real humans captive and tormenting them. Therefore, the correct behavior implied by the terminal value is to kill all the apparent humans so that the real humans can be freed, while disbelieving any supposed counterevidence presented by the ‘humans’.
To an outside observer — eg a human now being hunted by the AI — this is for all intents and purposes the same as the AI no longer having human welfare as a terminal value.
While I agree with this as a philosophical stance, it seems clear that, at least in humans, facts and values are in practice entangled.
I agree humans absorb (terminal) values from people around them. But this property isn’t something I want in a powerful AI. I think it’s clearly possible to design an agent that doesn’t have the “absorbs terminal values” property, do you agree?
Even if the terminal value doesn’t change, there are facts that could result in behavior that, to a reasonable outside observer, seems to reflect a conflicting value.
Yeah! I see this as a different problem from the value binding problem, but just as important. We can split it into two cases:
The new beliefs (that lead to bad actions) are false.[1] To avoid this happening we need to do a good job designing the epistemics of the AI. It’ll be impossible to avoid misleading false beliefs with certainty, but I expect there to be statistical learning type results that say that it’s unlikely and becomes more unlikely with more observation and thinking.
The new beliefs (that lead to bad actions) are true, and unknown to the human AI designers. (e.g. we’re in a simulation and the gods of the simulation have set things up such that the best thing for the AI to do looks evil from the human perspective. The AI is acting in our interest here. Maybe out of caution we want to design the AI values such that it wants to shut down in circumstances this extreme, just in case there’s been an epistemic problem and it’s actually case 1.
I’m assuming a correspondence theory of truth, where beliefs can be said to be true or false without reference to actions or values. This is often a crux with people that are into active inference or shard theory.
In that situation, how do we want an AI to act? There’s a few options, but doing nothing seems like a good default. Constructing the value binding algorithm such that this is the resulting behaviour doesn’t seem that hard, but it might not be trivial.
(and I imagine that the kind of ‘my values bind to something, but in such a way that that it’ll cause me to take very different options than before’ I describe above is much harder to specify)
I think you’re mostly right about the problem but the conclusion doesn’t follow.
First a nitpick: If you find out you’re being manipulated, your terminal values shouldn’t change (unless your mind is broken somehow, or not reflectively stable).
But there’s a similar issue: You could discover that all your previous observations were fed to you by a malicious demon, and nothing that you previously cared about actually exists. So your values don’t bind to anything in the new world you find yourself in.
In that situation, how do we want an AI to act? There’s a few options, but doing nothing seems like a good default. Constructing the value binding algorithm such that this is the resulting behaviour doesn’t seem that hard, but it might not be trivial.
Does this engage with what you’re saying?
Thanks! This is a reply to both you and @nowl, who I take to be (in part) making similar claims.
I could have been clearer in both my thinking and my wording.
While I agree with this as a philosophical stance, it seems clear that, at least in humans, facts and values are in practice entangled. As a classic example in fiction, if you’ve absorbed some terminal values from your sensei and then your sensei turns out to be the monster who murdered your parents, you’re likely to reject some of those values as a result.
But
This is closer to what I meant to point to — something like values-as-expressed-in-the-world. Even if the terminal value doesn’t change, there are facts that could result in behavior that, to a reasonable outside observer, seems to reflect a conflicting value.
For example: if we successfully instilled human welfare as a terminal value into an AI, there is (presumably) some set of observations which could convince the system that all the apparent humans it comes in contact with are actually aliens in human suits, who are holding all the real humans captive and tormenting them. Therefore, the correct behavior implied by the terminal value is to kill all the apparent humans so that the real humans can be freed, while disbelieving any supposed counterevidence presented by the ‘humans’.
To an outside observer — eg a human now being hunted by the AI — this is for all intents and purposes the same as the AI no longer having human welfare as a terminal value.
Possibly I’m now just reinventing epiwheels?
I agree humans absorb (terminal) values from people around them. But this property isn’t something I want in a powerful AI. I think it’s clearly possible to design an agent that doesn’t have the “absorbs terminal values” property, do you agree?
Yeah! I see this as a different problem from the value binding problem, but just as important. We can split it into two cases:
The new beliefs (that lead to bad actions) are false.[1] To avoid this happening we need to do a good job designing the epistemics of the AI. It’ll be impossible to avoid misleading false beliefs with certainty, but I expect there to be statistical learning type results that say that it’s unlikely and becomes more unlikely with more observation and thinking.
The new beliefs (that lead to bad actions) are true, and unknown to the human AI designers. (e.g. we’re in a simulation and the gods of the simulation have set things up such that the best thing for the AI to do looks evil from the human perspective. The AI is acting in our interest here. Maybe out of caution we want to design the AI values such that it wants to shut down in circumstances this extreme, just in case there’s been an epistemic problem and it’s actually case 1.
I’m assuming a correspondence theory of truth, where beliefs can be said to be true or false without reference to actions or values. This is often a crux with people that are into active inference or shard theory.
[addendum]
(and I imagine that the kind of ‘my values bind to something, but in such a way that that it’ll cause me to take very different options than before’ I describe above is much harder to specify)