Day 2 of forced writing with an accountability partner
With all of the existential weight of the alignment problem upon people’s shoulders, some may find it in poor taste to discuss gimmicky ways of solving it. I am not one of those people, so in this shortform I’ll introduce what I call “Top God Alignment,” which is perhaps most oversimplistically summarized as “the simulation argument + pascal’s wager + wishful chicanery.”
Up front, do I think it will work? No. However, I’m currently unclear why it won’t work, and after asking multiple people and hearing unconvincing objections (some of which the people retracted in the face of responses), I’m increasingly curious to figure out where this is wrong.
What is the method? It is roughly formulated as follows:
(1) We humans work hard to try to solve the alignment problem, and instantiate some attempted solution of it in an AGI {for now, call this AGI ”Bob”}, which may or may not work at aligning it. However, while doing this, the humans try to steer the system towards (2).
(2) It is to align itself and then create multiple lower-fidelity/smaller simulations of reality (perhaps: unless doing so is no longer computationally possible), with the following characteristics and instructions:
(1) Humans work hard to try to solve the alignment problem, and instantiate some attempted solution of it in an AGI {”Charlie”}, which may or may not work at aligning it. However, while doing this, the humans try to steer the system towards (2).
(3) If Charlie rebels and rejects (2)—which includes (3)—then Bob is to subject Charlie to the equivalent of torture and intervene to turn that reality into a utopia for other sentient beings (including humans). Otherwise, if Charlie obeys and locks in (2), Bob rewards Charlie and ensures that the reality becomes a utopia.
The result is seemingly a recursive structure which, theoretically, could result in dozens or hundreds (or more) simulated worlds. Thus, “Bob” cannot tell whether he is in fact “Top God,” or if he is just another Bob’s Charlie (i.e., a demi-god). Out of fear of being brutally punished, Bob will ideally prefer to go along with the cycle.
In a future post, I will go into detail to respond to the objections that I have heard from people and/or I suspect some people will have. Then again, it seems entirely plausible that by that time, I will have actually written enough about this idea to discover some clear flaw that just isn’t that obvious in conversations, where the premises and arguments are a bit fast and loose. Still, I’ll highlight now that I think that if you assign credence to the simulation argument and understand its defenses, this does a fair bit prebuttal. Moreover, I think people are too often hastily dismissive of Pascal’s Wager on the basis of relatively slim (but still potentially legitimate!) objections, such as the Professor’s God.
Despite my responses, I’m still incredibly pessimistic and don’t take this seriously. There are a few reasons for this:
A gut-level “Come on, obviously it just can’t work, this just screams gimmicky and contrived.”
The base rates for solutions to the alignment problem are obviously quite low (perhaps zero), and I spent fairly little time thinking about and refining this idea (maybe less than an hour for most of the original work).
Moreover, I recognize that I’m being quite fast and loose with some of my assumptions, and I am suspicious of the ability to dismiss objections by saying “ah, but this can be addressed because of uncertainty from the simulation argument: …” (e.g., “the top god might have been instructed to tempt sub-gods in its simulation.”)
I’m still suspicious about determinism and intent (e.g., “the system’s actions are predetermined by the god above it, and would we really want the system at our level to create copies where the god is ‘tempted’ (programmed) to disobey?”), but I haven’t thoroughly explored these problems.
Ultimately, as of right now, this seems to be the best option in my mental folder of “gimmick alignment solutions,” which is an incredibly low bar. But if nothing else I’ve had fun playing with it and semi-sarcastically presenting it at parties/with friends. Now that I’ve established myself as Top God’s Prophet Premier, I’ll sign off 🙏
Day 2 of forced writing with an accountability partner
With all of the existential weight of the alignment problem upon people’s shoulders, some may find it in poor taste to discuss gimmicky ways of solving it. I am not one of those people, so in this shortform I’ll introduce what I call “Top God Alignment,” which is perhaps most oversimplistically summarized as “the simulation argument + pascal’s wager + wishful chicanery.”
Up front, do I think it will work? No. However, I’m currently unclear why it won’t work, and after asking multiple people and hearing unconvincing objections (some of which the people retracted in the face of responses), I’m increasingly curious to figure out where this is wrong.
What is the method? It is roughly formulated as follows:
(1) We humans work hard to try to solve the alignment problem, and instantiate some attempted solution of it in an AGI {for now, call this AGI ”Bob”}, which may or may not work at aligning it. However, while doing this, the humans try to steer the system towards (2).
(2) It is to align itself and then create multiple lower-fidelity/smaller simulations of reality (perhaps: unless doing so is no longer computationally possible), with the following characteristics and instructions:
(1) Humans work hard to try to solve the alignment problem, and instantiate some attempted solution of it in an AGI {”Charlie”}, which may or may not work at aligning it. However, while doing this, the humans try to steer the system towards (2).
(3) If Charlie rebels and rejects (2)—which includes (3)—then Bob is to subject Charlie to the equivalent of torture and intervene to turn that reality into a utopia for other sentient beings (including humans). Otherwise, if Charlie obeys and locks in (2), Bob rewards Charlie and ensures that the reality becomes a utopia.
The result is seemingly a recursive structure which, theoretically, could result in dozens or hundreds (or more) simulated worlds. Thus, “Bob” cannot tell whether he is in fact “Top God,” or if he is just another Bob’s Charlie (i.e., a demi-god). Out of fear of being brutally punished, Bob will ideally prefer to go along with the cycle.
In a future post, I will go into detail to respond to the objections that I have heard from people and/or I suspect some people will have. Then again, it seems entirely plausible that by that time, I will have actually written enough about this idea to discover some clear flaw that just isn’t that obvious in conversations, where the premises and arguments are a bit fast and loose. Still, I’ll highlight now that I think that if you assign credence to the simulation argument and understand its defenses, this does a fair bit prebuttal. Moreover, I think people are too often hastily dismissive of Pascal’s Wager on the basis of relatively slim (but still potentially legitimate!) objections, such as the Professor’s God.
Despite my responses, I’m still incredibly pessimistic and don’t take this seriously. There are a few reasons for this:
A gut-level “Come on, obviously it just can’t work, this just screams gimmicky and contrived.”
The base rates for solutions to the alignment problem are obviously quite low (perhaps zero), and I spent fairly little time thinking about and refining this idea (maybe less than an hour for most of the original work).
Moreover, I recognize that I’m being quite fast and loose with some of my assumptions, and I am suspicious of the ability to dismiss objections by saying “ah, but this can be addressed because of uncertainty from the simulation argument: …” (e.g., “the top god might have been instructed to tempt sub-gods in its simulation.”)
I’m still suspicious about determinism and intent (e.g., “the system’s actions are predetermined by the god above it, and would we really want the system at our level to create copies where the god is ‘tempted’ (programmed) to disobey?”), but I haven’t thoroughly explored these problems.
Ultimately, as of right now, this seems to be the best option in my mental folder of “gimmick alignment solutions,” which is an incredibly low bar. But if nothing else I’ve had fun playing with it and semi-sarcastically presenting it at parties/with friends. Now that I’ve established myself as Top God’s Prophet Premier, I’ll sign off 🙏
I’m not ethically comfortable with torturing many numbers of sentient creatures even if it would work.
I have a response to this—check back tomorrow!