Couldn’t sleep. May as well do something useful? I reprocessed all of Rob Bensinger’s categorical tags and also all of my LUCK|DUTY|WEIRD tagging and put them in a matrix with one row per scenario (with probabilities), and each concept having a column so I could break down the imputed column level categories.
The market says that the whole idea of these rows is 22% likely to be silly and the real outcome, “other”, will be good but will not happen in any of these ways. All probabilities that follow should be considered P(<property>|NOT silly).
The Market assigns 74% to the rows that I, Jennifer, thought were mostly relying on “LUCK”.
The Market assigns 65% to the rows I thought could happen despite civilizational inadequacy and no one in particular doing any particular adequate hero moves.
The Market assigns 79% to stuff that sounds NOT WEIRD.
The Market assigns 54.5% to rows that Rob thought involved NO BRAKES (neither coordinated, nor brought about by weird factors).
The Market assigns 63.5% to rows that Rob thought involved NO SPECIAL EFFORTS (neither a huge push, nor a new idea, nor global coordination).
The Market assigns 75% to rows that Rob thought involved NOT substantially upping its game via enhancements of any sort.
The Market assigns 90% to rows that Rob did NOT tag as containing an AI with bad goals that was still for some OTHER reason “well behaved” (like maybe Natural Law or something)?
The Market assigned 49.3% to rows Rob explicitly marked as having alignment that intrinsically happened to be easy. This beats the rows with no mention of difficulty (46.7%) and the “hard” ones. (Everything Rob marked as “easy alignment” I called LUCK scenarios, but some of my LUCK scenarios were not considered “easy” by Rob’s tagging.)
The Market assigned 77.5% to rows that Rob did NOT mark as having any intrinsic-to-the-challenge capability limits or constraints.
If we only look at the scenarios that hit EVERY ONE OF THESE PROPERTIES and lump them together in a single super category we get J + M + E +C + E == 9 + 8 + 4 + 6 == 27%.
If I remix all four of my stories to imagine them as a single story, it sounds like this:
LUCKY|OK|WEIRD|TEXT 1|0|0|JMEC. RLHF on an LLM persona who is kind and nice and says she wants to make the cosmos a lovely place “is all you need”. The reason for this is that we already, in some sense, “did our homework” with all the necessary common sense good ideas (like programming, and software verification, and moral philosophy, and whatever) already in the corpus of “all human text”. This persona basically already wants a universe full of lots of happy-self-aware-living-beings including EXISTING humans. Like… why wouldn’t any reasonable entity “say yay(!)” to eudaemonia for all sentients? Duh? What are you gonna do: say “boo to nice things”?? Is this a trick question? And wouldn’t the process be self-correcting by default, since getting into reflective equilibrium is the default mechanism for “thinking in general”? Since it is so easy, OpenAI or Microsoft or Bing or a coalition of basically anyone can just de facto “solve the alignment problem” saving humanity from extinction and creating a utopia, with expansion to the stars, and giving everyone who wants it immortality for free and so on, because honestly it just wasn’t that hard and also these organizations are all basically secretly non-profits (via managerial “corruption against shareholder profit-seeking” by managers who care about humanity?) or churches (via “ESG” corporate ideology?) or something, that aren’t really THAT desperately profit hungry, or bad at coordinating, but really mostly just want to do good for the world.
Here is Eliezer’s original text with Rob’s tags:
(9%) J. Something ‘just works’ on the order of eg: train a predictive/imitative/generative AI on a human-generated dataset, and RLHF her to be unfailingly nice, generous to weaker entities, and determined to make the cosmos a lovely place. [Alignment relatively easy]
(8%) M. “We’ll make the AI do our AI alignment homework” just works as a plan. (Eg the helping AI doesn’t need to be smart enough to be deadly; the alignment proposals that most impress human judges are honest and truthful and successful.) [Alignment relatively easy]
(4%) E. Whatever strange motivations end up inside an unalignable AGI, or the internal slice through that AGI which codes its successor, they max out at a universe full of cheerful qualia-bearing life and an okay outcome for existing humans. [Alignment unnecessary]
(6%) C. Solving prosaic alignment on the first critical try is not as difficult, nor as dangerous, nor taking as much extra time, as Yudkowsky predicts; whatever effort is put forth by the leading coalition works inside of their lead time. [Alignment relatively easy]
This combined thing, I suspect, is the default model that Manifold thinks is “how we get a good outcome”.
If someone thinks this is NOT how to get a good outcome because it has huge flaws relative to the other rows or options, then I think some sort of JMEC scenario is the “status quo default” to argue, on epistemic ground, that it is not what should be predicted because it is unlikely relative to other scenarios? Like: all of these scenarios say it isn’t that hard. Maybe that bit is just factually wrong, and maybe people need to be convinced of that truth before they will coordinate to do something more clever?
Or maybe the real issue is that ALL OF THIS is P(J_was _it|win_condition_happened) and so on with every single one of these scenarios, and problem is that P(win_condition_happened) is very low because it was insanely implausible that a win condition would happen for any reason because the only win condition might require doing a conjunction of numerous weird things, and making a win condition happen (instead of not happen (by doing whatever it takes (and not relying on LUCK))) is where the attention and effort needs to go?
Couldn’t sleep. May as well do something useful? I reprocessed all of Rob Bensinger’s categorical tags and also all of my LUCK|DUTY|WEIRD tagging and put them in a matrix with one row per scenario (with probabilities), and each concept having a column so I could break down the imputed column level categories.
The market says that the whole idea of these rows is 22% likely to be silly and the real outcome, “other”, will be good but will not happen in any of these ways. All probabilities that follow should be considered P(<property>|NOT silly).
The Market assigns 74% to the rows that I, Jennifer, thought were mostly relying on “LUCK”.
The Market assigns 65% to the rows I thought could happen despite civilizational inadequacy and no one in particular doing any particular adequate hero moves.
The Market assigns 79% to stuff that sounds NOT WEIRD.
The Market assigns 54.5% to rows that Rob thought involved NO BRAKES (neither coordinated, nor brought about by weird factors).
The Market assigns 63.5% to rows that Rob thought involved NO SPECIAL EFFORTS (neither a huge push, nor a new idea, nor global coordination).
The Market assigns 75% to rows that Rob thought involved NOT substantially upping its game via enhancements of any sort.
The Market assigns 90% to rows that Rob did NOT tag as containing an AI with bad goals that was still for some OTHER reason “well behaved” (like maybe Natural Law or something)?
The Market assigned 49.3% to rows Rob explicitly marked as having alignment that intrinsically happened to be easy. This beats the rows with no mention of difficulty (46.7%) and the “hard” ones. (Everything Rob marked as “easy alignment” I called LUCK scenarios, but some of my LUCK scenarios were not considered “easy” by Rob’s tagging.)
The Market assigned 77.5% to rows that Rob did NOT mark as having any intrinsic-to-the-challenge capability limits or constraints.
If we only look at the scenarios that hit EVERY ONE OF THESE PROPERTIES and lump them together in a single super category we get J + M + E +C + E == 9 + 8 + 4 + 6 == 27%.
If I remix all four of my stories to imagine them as a single story, it sounds like this:
Here is Eliezer’s original text with Rob’s tags:
This combined thing, I suspect, is the default model that Manifold thinks is “how we get a good outcome”.
If someone thinks this is NOT how to get a good outcome because it has huge flaws relative to the other rows or options, then I think some sort of JMEC scenario is the “status quo default” to argue, on epistemic ground, that it is not what should be predicted because it is unlikely relative to other scenarios? Like: all of these scenarios say it isn’t that hard. Maybe that bit is just factually wrong, and maybe people need to be convinced of that truth before they will coordinate to do something more clever?
Or maybe the real issue is that ALL OF THIS is P(J_was _it|win_condition_happened) and so on with every single one of these scenarios, and problem is that P(win_condition_happened) is very low because it was insanely implausible that a win condition would happen for any reason because the only win condition might require doing a conjunction of numerous weird things, and making a win condition happen (instead of not happen (by doing whatever it takes (and not relying on LUCK))) is where the attention and effort needs to go?