First of all, I do agree with your premises and the stated values, except for the assumption that “technically superior alien race” would be safe to create. If such an AI would have its own values other than helping humans/other AIs/whatever, then I’m not sure how I feel about it balancing its own internal rewards (like getting an enormous amount of virtual food that in its training environment could save billions of non-copyable agents) against real-world goals (like saving 1 actual human). We certainly want a powerful AI to be our ally rather than trying to contain it against its will, but I don’t think we should go with the “autonomous being” analogy far enough to make it chase its own unrelated goals.
Now about the technique itself. This is much better than the previous post. It is still a very much an unrealistic game (which I presume is obvious), and you can’t just take an agent from the game and put it into a real-life robot or something, there’s no “food” to be “gathered” here and blah-blah-blah. This is still an interesting experiment, however, as in trying to replicate human-like values in an RL agent, and I will treat it as such. I believe the most important part of your rules is this:
when a non-copyable agent dies, all agents get negative rewards
The result of this depend on the exact amount. If the negative reward is huge, then obviously agents will just protect the non-copyable ones and would never try to kill them. If some rogue agent tries, then it will be stopped (possibly killed) before it can. This is the basic value I would like AIs to have, the problem with that is that we can’t specify well enough what counts as “harm” in the real world. Even if AI won’t literally kill us, it can still do lots of horrible things. However, if we could just do that, the rest of the rules are redundant: for example, if something harms humans, then that thing should be stopped, doesn’t have to be an additional rule. If cooperating with some entity leads to less harm for humans, then that’s good, no need for additional rule. Just “minimize harm to humans” suffices.
If the negative reward is significant, but an individual AI can still get positive total by killing a non-copyable agent (or stealing their food or whatever), then we have Prisoner’s Dilemma situation. Presumably, if the agents are aligned, they will also try to stop this rogue AI from doing so, or at least punish it so that it won’t do that again. That will work well as long as the agent community is effective at detecting and punishing these rogue AIs (judicial system?). If the agent community is inefficient, then it is possible for an individual agent to gain reward by doing a bad action, so it will do so, if it thinks it can evade the punishment of others. “Behave altruistically unless you can find an opportunity to gain utility” is not that much more difficult than just “Behave altruistically always”, if we expect AIs to be somewhat smart, I thing we should expect them to know that deception is an option.
For the agent community to be efficient at detecting rogues, it has to have constant optimization pressure for that, i.e. constant threat of such rogues (or that part gets optimized away). Then what we get is an equilibrium where there is some stable amount of rogues, some of which gets caught and punished, and some don’t and get positive reward, and the regular AI community that does the punishing. The equilibrium would occur because optimizing deception skills and deception detection skills would require resources, and after some point that would be inefficient use of resources for both sides. Thus I believe this system would stabilize and proceed like that indefinitely. What we can learn from that, I don’t know, but that does reflect the situation we have in human civilization (some stable amount of criminals, and some stable amount of “good people”, who try their best to prevent crime from happening, but trying even harder is too expensive)
I think the application to the Hero With A Thousand Chances is partly incorrect because of a technicality. Consider the following hypothesis: there is a huge number of “parallel worlds” (not Everett branches, just thinking of different planets very far away is enough) each fighting the Dust. Every fight each of those worlds summons a randomly selected hero. Today that hero happened to be you. The world that happened to summon you has survived the encounter with the Dust 1079 times before you. The world next to it has already survived 2181, and the other one was destroyed during the 129th attempt.
This hypothesis explains the observation of the hero pretty well—you can’t get summoned to a world that’s destroyed or has successfully eliminated the Dust, so of course you get summoned to a world that is still fighting. As for the 1079 attempts before you, you can’t consider that a lot of evidence for fighting the Dust being easy, maybe you’re just 1080th entry in their database and can only be summoned for the 1080th attempt, there was no way for you to observe anything else. Under this hypothesis, you personally still have a pretty high chance of dying—there’s no angel helping you, that specific world really did get very lucky, as did lots of other worlds.
So, anthropic angel/”we’re in a fanfic” hypothesis explains observations just as well as this “many worlds, and you’re 1079th on their database” hypothesis, so they’re updated by the same amount, and at least for me the “many worlds” hypothesis has much higher prior than “I’m a fanfic character” hypothesis.
Note, this only holds from hero’s perspective: from Aerhien’s POV she really has observed herself survive 1079 times, which counts as a lot of evidence for the anthropic angel.