How to prevent “aligned” AIs from unintentionally corrupting human values?
I discussed this in The Mutable Values Problem in Value Learning and CEV — it’s a hard problem. If we build a society out of AIs and humans, they’re going to be tightly coupled, and humans’ values are inevitably going to be influenced in many ways by the presence, actions, and results of AIs. For the AIs, “what values would humans have had if we hadn’t existed?” is going to become and increasingly hard-to figure-out counterfactual. Add that to the fact that one ethical system’s corruption is another ethical system’s improvement, and it turns into a big hairy ethical challenge wrapped in an extremely complex high dimensional nonlinear dynamical system. A random walk through a very high-dimensional space typically diverges.
I discussed this in The Mutable Values Problem in Value Learning and CEV — it’s a hard problem. If we build a society out of AIs and humans, they’re going to be tightly coupled, and humans’ values are inevitably going to be influenced in many ways by the presence, actions, and results of AIs. For the AIs, “what values would humans have had if we hadn’t existed?” is going to become and increasingly hard-to figure-out counterfactual. Add that to the fact that one ethical system’s corruption is another ethical system’s improvement, and it turns into a big hairy ethical challenge wrapped in an extremely complex high dimensional nonlinear dynamical system. A random walk through a very high-dimensional space typically diverges.