While I agree with this as a philosophical stance, it seems clear that, at least in humans, facts and values are in practice entangled.
I agree humans absorb (terminal) values from people around them. But this property isn’t something I want in a powerful AI. I think it’s clearly possible to design an agent that doesn’t have the “absorbs terminal values” property, do you agree?
Even if the terminal value doesn’t change, there are facts that could result in behavior that, to a reasonable outside observer, seems to reflect a conflicting value.
Yeah! I see this as a different problem from the value binding problem, but just as important. We can split it into two cases:
The new beliefs (that lead to bad actions) are false.[1] To avoid this happening we need to do a good job designing the epistemics of the AI. It’ll be impossible to avoid misleading false beliefs with certainty, but I expect there to be statistical learning type results that say that it’s unlikely and becomes more unlikely with more observation and thinking.
The new beliefs (that lead to bad actions) are true, and unknown to the human AI designers. (e.g. we’re in a simulation and the gods of the simulation have set things up such that the best thing for the AI to do looks evil from the human perspective. The AI is acting in our interest here. Maybe out of caution we want to design the AI values such that it wants to shut down in circumstances this extreme, just in case there’s been an epistemic problem and it’s actually case 1.
I’m assuming a correspondence theory of truth, where beliefs can be said to be true or false without reference to actions or values. This is often a crux with people that are into active inference or shard theory.
I agree humans absorb (terminal) values from people around them. But this property isn’t something I want in a powerful AI. I think it’s clearly possible to design an agent that doesn’t have the “absorbs terminal values” property, do you agree?
Yeah! I see this as a different problem from the value binding problem, but just as important. We can split it into two cases:
The new beliefs (that lead to bad actions) are false.[1] To avoid this happening we need to do a good job designing the epistemics of the AI. It’ll be impossible to avoid misleading false beliefs with certainty, but I expect there to be statistical learning type results that say that it’s unlikely and becomes more unlikely with more observation and thinking.
The new beliefs (that lead to bad actions) are true, and unknown to the human AI designers. (e.g. we’re in a simulation and the gods of the simulation have set things up such that the best thing for the AI to do looks evil from the human perspective. The AI is acting in our interest here. Maybe out of caution we want to design the AI values such that it wants to shut down in circumstances this extreme, just in case there’s been an epistemic problem and it’s actually case 1.
I’m assuming a correspondence theory of truth, where beliefs can be said to be true or false without reference to actions or values. This is often a crux with people that are into active inference or shard theory.