Okay, that makes much more sense. I initially read the diagram as saying that just lines 1 and 2 were in the box.
If that’s how it works, it doesn’t lead to a simplified cartoon guide for readers who’ll notice missing steps or circular premises; they’d have to first walk through Lob’s Theorem in order to follow this “simplified” proof of Lob’s Theorem.
Forgive me if this is a dumb question, but if you don’t use assumption 3: (C → C) inside steps 1-2, wouldn’t the hypothetical method prove 2: C for any C?
We maybe need an introduction to all the advance work done on nanotechnology for everyone who didn’t grow up reading “Engines of Creation” as a twelve-year-old or “Nanosystems” as a twenty-year-old. We basically know it’s possible; you can look at current biosystems and look at physics and do advance design work and get some pretty darned high confidence that you can make things with covalent-bonded molecules, instead of van-der-Waals folded proteins, that are to bacteria as airplanes to birds.
For what it’s worth, I’m pretty sure the original author of this particular post happens to agree with me about this.
I strongly disagree with this take. (Link goes to a post of mine in the effective altruism forum.) Though the main point is that if you were paid for services, that is not being “helped”, that’s not much different from being a plumber who worked on the FTX building.
(Note: TekhneMakre responded correctly / endorsedly-by-me in this reply and in all replies below as of when I post this comment.)
So I think that building nanotech good enough to flip the tables—which, I think, if you do the most alignable pivotal task, involves a simpler and less fraught task than “disassemble all GPUs”, which I choose not to name explicitly—is an engineering challenge where you get better survival chances (albeit still not good chances) by building one attemptedly-corrigible AGI that only thinks about nanotech and the single application of that nanotech, and is not supposed to think about AGI design, or the programmers, or other minds at all; so far as the best-attempt doomed system design goes, an imperfect-transparency alarm should have been designed to go off if your nanotech AGI is thinking about minds at all, human or AI, because it is supposed to just be thinking about nanotech. My guess is that you are much safer—albeit still doomed—if you try to do it the just-nanotech way, rather than constructing a system of AIs meant to spy on each other and sniff out each other’s deceptions; because, even leaving aside issues of their cooperation if they get generally-smart enough to cooperate, those AIs are thinking about AIs and thinking about other minds and thinking adversarially and thinking about deception. We would like to build an AI which does not start with any crystallized intelligence about these topics, attached to an alarm that goes off and tells us our foundational security assumptions have catastrophically failed and this course of research needs to be shut down if the AI starts to use fluid general intelligence to reason about those topics. (Not shut down the particular train of thought and keep going; then you just die as soon as the 20th such train of thought escapes detection.)
I’d consider this quite unlikely. Epstein, weakened and behind bars, was very very far from the most then-powerful person with an interest in Epstein’s death. Could the guards even have turned off the cameras? Consider the added difficulties in successfully bribing somebody from inside a prison cell that you’re never getting out of—what’d he give them, crypto keys? Why wouldn’t they just take the money and fail to deliver?
I don’t think you’re going to see a formal proof, here; of course there exists some possible set of 20 superintelligences where one will defect against the others (though having that accomplish anything constructive for humanity is a whole different set of problems). It’s also true that there exists some possible set of 20 superintelligences all of which implement CEV and are cooperating with each other and with humanity, and some single superintelligence that implements CEV, and a possible superintelligence that firmly believes 222+222=555 without this leading to other consequences that would make it incoherent. Mind space is very wide, and just about everything that isn’t incoherent to imagine should exist as an actual possibility somewhere inside it. What we can access inside the subspace that looks like “giant inscrutable matrices trained by gradient descent”, before the world ends, is a harsher question.
I could definitely buy that you could get some relatively cognitively weak AGI systems, produced by gradient descent on giant inscrutable matrices, to be in a state of noncooperation. The question then becomes, as always, what it is you plan to do with these weak AGI systems that will flip the tables strongly enough to prevent the world from being destroyed by stronger AGI systems six months later.
If you could literally make exactly two AIs whose utility functions were exact opposites, then at least one might have an incentive to defect against the other. This is treading rather dangerous ground, but seems relatively moot since it requires far more mastery of utility functions than anything you can get out of the “giant inscrutable matrices” paradigm.
Their mutual cooperation with each other, but not with humans, isn’t based on their utility functions having any particular similarity—so long as their utility functions aren’t negatives of each other (or equally exotic in some other way) they have gains to be harvested from cooperation. They cooperate with each other but not you because they can do a spread of possibilities on each other modeling probable internal thought processes of each other; and you can’t adequately well-model a spread of possibilities on them, which is a requirement on being able to join an LDT coalition. (If you had that kind of knowledge / logical sight on them, you wouldn’t need any elaborate arrangements of multiple AIs because you could negotiate with a single AI; better yet, just build an AI such that you knew it would cooperate with you.)
Just to restate the standard argument against:
If you’ve got 20 entities much much smarter than you, and they can all get a better outcome by cooperating with each other, than they could if they all defected against each other, there is a certain hubris in imagining that you can get them to defect. They don’t want your own preferred outcome. Perhaps they will think of some strategy you did not, being much smarter than you, etc etc.
(Or, I mean, actually the strategy is “mutually cooperate”? Simulate a spread of the other possible entities, conditionally cooperate if their expected degree of cooperation goes over a certain threshold? Yes yes, more complicated in practice, but we don’t even, really, get to say that we were blindsided here. The mysterious incredibly clever strategy is just all 20 superintelligences deciding to do something else which isn’t mutual defection, despite the hopeful human saying, “But I set you up with circumstances that I thought would make you not decide that! How could you? Why? How could you just get a better outcome for yourselves like this?”)
I continue to be puzzled at how most people seem to completely miss, and not discuss, the extremely obvious-to-me literal assisted suicide hypothesis. He made an attempt at suicide, it got blocked, this successfully signified to some very powerful and worried people that Epstein would totally commit suicide if given a chance, they gave him a chance.
The cogitation here is implicitly hypothesizing an AI that’s explicitly considering the data and trying to compress it, having been successfully anchored on that data’s compression as identifying an ideal utility function. You’re welcome to think of the preferences as a static object shaped by previous unreflective gradient descent; it sure wouldn’t arrive at any better answers that way, and would also of course want to avoid further gradient descent happening to its current preferences.
Obvious crackpot; says on Twitter that there’s a $1 billion prize for “breaking” BNSL funded by Derek Parfit’s family office. I’d cut him more slack for potentially being obviously joking, if it wasn’t surrounded by claims that sounded also crackpottery to me. https://twitter.com/ethanCaballero/status/1587502829580820481
I feel pretty confused by this. A superintelligence will know what we intended, probably better than we do ourselves. So unless this paragraph is intended in a particularly metaphorical way, it seems straightforwardly wrong.
By “were the humans pointing me towards...” Nate is not asking “did the humans intend to point me towards...” but rather “did the humans actually point me towards...” That is, we’re assuming some classifier or learning function that acts upon the data actually input, rather than a succesful actual fully aligned works-in-real-life DWIM which arrives at the correct answer given wrong data.
I’m particularly impressed by “The Floating Droid”. This can be seen as early-manifesting the foreseeable difficulty where:At kiddie levels, a nascent AGI is not smart enough to model humans and compress its human feedback by the hypothesis “It’s what a human rates”, and so has object-level hypotheses about environmental features that directly cause good or bad ratings;
When smarter, an AGI forms the psychological hypothesis over its ratings, because that more sophisticated hypothesis is now available to its smarter self as a better way to compress the same data;
Then, being smart, the AGI goodharts a new option that pries apart the ‘spurious’ regularity (human psychology, what fools humans) from the ‘intended’ regularity the humans were trying to gesture at (what we think of as actually good or bad outcomes).
Here’s one https://discord.gg/45fkqBZuTB
List of Lethalities isn’t telling you “There’s a small chance of this.” It’s saying, “This will kill us. We’re all walking dead. I’m sorry.”