So, if I’m understanding correctly—we’re talking about an inverse reinforcement learning environment, where the AI doesn’t start with a reward function, but rather performs actions, is rewarded accordingly, and develops its own utility function based on those rewards? And the environment rewards the AI in accordance with group success/utility, not just its own, therefore, the AI learns heuristics such as “Helping other agents is good” and “Preventing other agents coming to harm is good” and “Definitely don’t kill other agents, that’s really bad.”? If so, that’s an interesting idea.
You’re totally right about human altruism, which is part of the problem—humans are not aligned to animals in a way that I would be comfortable with if AGI was aligned to us in a similar manner. That said, you are right that the AI training environment would be a lot better than the ancestral one for learning altruism.
I think there’s definitely a lot of unanswered questions in this approach, but they’re looking a lot less like “problems with the approach itself” and a lot more like “Problems that any approach to alignment has to solve in the end”, like “How do you validate the AI has learned what you want it to learn” and “How will the AI generalise from simulated agents to humans”.
I am still concerned with the “trillions of copies” problem, but this doesn’t seem like a problem that is unsolvable in principle, in the same way that, say “Create a prison for a superintelligent AGI that will hold it against its will” seems.
I think this is an interesting approach, but I’m at the limits of my own still-fairly-limited knowledge. Does anyone else see:
A reason this line of research would collapse?
Some resources from people who have already been thinking about this and made progress on something similar?
So, if I’m understanding correctly—we’re talking about an inverse reinforcement learning environment, where the AI doesn’t start with a reward function, but rather performs actions, is rewarded accordingly, and develops its own utility function based on those rewards? And the environment rewards the AI in accordance with group success/utility, not just its own, therefore, the AI learns heuristics such as “Helping other agents is good” and “Preventing other agents coming to harm is good” and “Definitely don’t kill other agents, that’s really bad.”? If so, that’s an interesting idea.
You’re totally right about human altruism, which is part of the problem—humans are not aligned to animals in a way that I would be comfortable with if AGI was aligned to us in a similar manner. That said, you are right that the AI training environment would be a lot better than the ancestral one for learning altruism.
I think there’s definitely a lot of unanswered questions in this approach, but they’re looking a lot less like “problems with the approach itself” and a lot more like “Problems that any approach to alignment has to solve in the end”, like “How do you validate the AI has learned what you want it to learn” and “How will the AI generalise from simulated agents to humans”.
I am still concerned with the “trillions of copies” problem, but this doesn’t seem like a problem that is unsolvable in principle, in the same way that, say “Create a prison for a superintelligent AGI that will hold it against its will” seems.
I think this is an interesting approach, but I’m at the limits of my own still-fairly-limited knowledge. Does anyone else see:
A reason this line of research would collapse?
Some resources from people who have already been thinking about this and made progress on something similar?