Let’s think about what happens if you subject humans to optimization according to these pressures. What kind of agents are you likely to get out? For the sake of the thought-experiment, let’s say that a super-intelligent and maximally-altruistic human is created by simbox to serve as an AI for a civilization of human-level-intelligent spiders.
To start, there is a massive distributional difference between the utility functions of sim-humans and spiders. Especially if the other sim-humans in the training environment were also maximally altruistic. We need the sim-humans to want to improve and maintain accurate models of other agents, and replacing its utility function with the distribution of other agents’ utility functions doesn’t guarantee that. Why should it want to improve its model of other agents’ utility instead of using its existing, less accurate model?
There is also the problem that optimizing for altruistic behavior probably decreases the accuracy of their other-agent utility function models. The most altruistic humans in reality probably have a distorted model of the utility function of the average human, due to a combination of the typical-mind fallacy and the fact that modeling other agents with inaccurately high altruism likely increases altruism. If we’re mainly selecting for altruistic behavior, we’re going to get a less accurate world-model, which when combined with the distributional shift may result in the sim-human having an erroneous model of the spiders. Maybe something the spiders value highly (the joy of consuming live prey) may be lost in translation, or given a lower priority than appropriate from the view of the spiders. For humans training a sim-AI, this could be something we really value like “romance”, “emotional states corresponding with reality”, or “boredom”.
There is no a-priori reason to expect the sims to want to avoid wire-heading its creators. Maybe that is the coherent extrapolated value of humanity, as deduced by a subset of sims. I don’t see any obvious ways to solve this problem without giving the sim access to physical reality.
Human beings are also known to have internal inconsistencies for the purpose of appearing maximally altruistic to interperatability tools (other people, empathy) while not actually being so. I’m not sure how this plays into your scenario, but it is worrying.
Having a wide highly generalized alignment target is not a problem; it should be the goal. Many humans—to varying degrees—learn very generalized abstract large empathy circle alignment targets, such that they generally care about animals and (hypothetically) aliens and robots—I recently saw a video of a child crying for the dying leaves falling from trees.
Having a wide robust circle of empathy does not preclude also learning more detail models of other agents desires.
To start, there is a massive distributional difference between the utility functions of sim-humans and spiders.
Given how humans can generalize empathy to any sentient agent, I don’t see this as a fundamental problem, and anyway the intelligent spider civ would be making spider-sims regardless.
Let’s think about what happens if you subject humans to optimization according to these pressures. What kind of agents are you likely to get out? For the sake of the thought-experiment, let’s say that a super-intelligent and maximally-altruistic human is created by simbox to serve as an AI for a civilization of human-level-intelligent spiders.
To start, there is a massive distributional difference between the utility functions of sim-humans and spiders. Especially if the other sim-humans in the training environment were also maximally altruistic. We need the sim-humans to want to improve and maintain accurate models of other agents, and replacing its utility function with the distribution of other agents’ utility functions doesn’t guarantee that. Why should it want to improve its model of other agents’ utility instead of using its existing, less accurate model?
There is also the problem that optimizing for altruistic behavior probably decreases the accuracy of their other-agent utility function models. The most altruistic humans in reality probably have a distorted model of the utility function of the average human, due to a combination of the typical-mind fallacy and the fact that modeling other agents with inaccurately high altruism likely increases altruism. If we’re mainly selecting for altruistic behavior, we’re going to get a less accurate world-model, which when combined with the distributional shift may result in the sim-human having an erroneous model of the spiders. Maybe something the spiders value highly (the joy of consuming live prey) may be lost in translation, or given a lower priority than appropriate from the view of the spiders. For humans training a sim-AI, this could be something we really value like “romance”, “emotional states corresponding with reality”, or “boredom”.
There is no a-priori reason to expect the sims to want to avoid wire-heading its creators. Maybe that is the coherent extrapolated value of humanity, as deduced by a subset of sims. I don’t see any obvious ways to solve this problem without giving the sim access to physical reality.
Human beings are also known to have internal inconsistencies for the purpose of appearing maximally altruistic to interperatability tools (other people, empathy) while not actually being so. I’m not sure how this plays into your scenario, but it is worrying.
Having a wide highly generalized alignment target is not a problem; it should be the goal. Many humans—to varying degrees—learn very generalized abstract large empathy circle alignment targets, such that they generally care about animals and (hypothetically) aliens and robots—I recently saw a video of a child crying for the dying leaves falling from trees.
Having a wide robust circle of empathy does not preclude also learning more detail models of other agents desires.
Given how humans can generalize empathy to any sentient agent, I don’t see this as a fundamental problem, and anyway the intelligent spider civ would be making spider-sims regardless.