Okay, so if that’s just a small component, then sure, first issue goes away (though I still have questions on how you’re gonna make this simulation realistic enough to just hook it up to an LLM or “something smart” and expect it to set coherent and meaningful goals in real life, though that’s more of a technical issue).
However, I believe there are still other issues with this appoach. The way you describe it makes me think it’s really similar to Axelrod’s Iterated Prisoner’s Dilemma tournament, and that did invent tit-for-tat strategy as one of the most successful ones. But that wasn’t the only successful strategy. For example there were strategies that were mostly tit-for-tat, but would defect if they could get away with it. If, for example, that still mostly results in tit-for-tat, except for some rare cases of it defecting when the agents in question are too stupid to “vote it out”, do we punish it?
Second, tit-for-tat is quite succeptible to noise. What if the trained agent misinterprets someone’s actions as “bad”, even if they in actuality did something innocent, like “affectionately pat their friend on the back”, which the AI interpreted as fighting? No matter how smart the AI gets, there still will be cases where it wrongly believes someone to be a liar, and so believes it has every justification to “deter” them. Do we really want even some small probability of that behaviour? How about an AI that doesn’t hurt humans unconditionally, and not only when it believes us to be “good agents”?[1]
Last thing, how does AI determine what is an “intelligent agent” and what is a “rock”? If there are explicit tags in the simulation for that thing, then how do you make sure every actual human in the real world gets tagged correctly? Also, do we count animals? Should animal abuse be enough justification for the AI to turn on the rage mode? What about accidentally stepping on an ant? And if you define “intelligent agent” as relative to the AI, when what do we do once it gets smart enough to rationally think of us like ants?
- ^
Somehow that reminds me of “I have been a good Bing, you have not been a good user” situation
I believe this would suffer from distributional shift in two different ways.
First, if the agents are supposed to scale up to the point where they can update their beliefs even after training, then we have a problem once the AI notices it can do pretty well without cooperating with humans in this new environment. If we allow agents to update their beliefs at runtime, then basically any reinforcement-learning-like preconditioning would be pretty much useless, I think. And if the agent can’t update its beliefs given new data then that can’t be an AGI.
Second, even if you solve the first problem somehow, there is a question of what exactly you mean by “cooperate” and “respect”. In the real world the choice is rarely binary between “cooperate” and “defect”, and there is often option of “do things that look like you’re cooperating while not actually putting much effort into it” (i.e. what most politicians do) or “actively try to deceive everyone looking to make them think you’re nice” (you don’t have to be actually smarter than everyone else for this, just smarter than everyone who cares enough to use their time to look at you closely). If you’re only giving the agents a binary choice, then that’s not realistic enough. If you’re giving them many choices, but also putting in an “objective” way to check if what they chose is “cooperative enough”, then all it takes is for the agents to deceive you, not their smarter AI colleagues. And if there’s no objective way to check that, then we’re back to describing human values with a utility function.