Okay, maybe I’m moving the bar, hopefully not and this thread is helpful...
Your counter-example, your simulation would prove that examples of aligned systems—at a high level—are possible. Alignment at some level is possible, of course. Functioning thermostats are aligned.
What I’m trying to propose is the search for a proof that a guarantee of alignment—all the way up—is mathematically impossible. We could then make the statement: “If we proceed down this path, no one will ever be able to guarantee that humans remain in control.” I’m proposing we see if we can prove that Stuart Russell’s “provably beneficial” does not exist.
If a guarantee is proved to be impossible, I am contending that the public conversation changes.
Maybe many people—especially on LessWrong—take this fact as a given. Their internal belief is close enough to a proof...that there is not a guarantee all the way up.
I think a proof that there is no guarantee would be important news for the wider world...the world that has to move if there is to be regulation.
Sorry, could you elaborate what you mean by all the way up?
All the way up meaning at increasing levels of intelligence…your 10,000 becomes 100,000X, etc.
At some level of performance, a moral person faces new temptations because of increased capabilities and greater power for damage, right?
In other words, your simulation may fail to be aligned at 20,000...30,000...