Well, huh. I wonder if that makes it time to go look at RadVac some more.
I agree that this is a correct application of security mindset; exposures like these can compound with, for example, someone’s automatic search of the 100 most common ways to screw up secure random number generation such as by using the current time as a seed. Deep security is about reducing the amount of thinking you have to do and your exposure to wrong models and stuff you didn’t think of.
I validate this as a nonfake alignment research direction that seems important.
Contrast my likewise-fictional story Kindness to Kin.
Just jaunt superquantumly to another quantum world instead of superluminally to an unobservable galaxy. What about these two physically impossible counterfactuals is less than perfectly isomorphic? Except for some mere ease of false-to-fact visualization inside a human imagination that finds it easier to track nonexistent imaginary Newtonian billiard balls than existent quantum clouds of amplitude, with the latter case, in reality, covering both unobservable galaxies distant in space and unobservable galaxies distant in phase space.
I reiterate the galaxy example; saying that you could counterfactually make an observation by violating physical law is not the same as saying that something’s meaning cashes out to anticipated experiences. Consider the (exact) analogy between believing that galaxies exist after they go over the horizon, and that other quantum worlds go on existing after we decohere them away from us by observing ourselves being inside only one of them. Predictivism is exactly the sort of ground on which some people have tried to claim that MWI isn’t meaningful, and they’re correct in that predictivism renders MWI meaningless just as it renders the claims “galaxies go on existing after we can no longer see them” meaningless. To reply “If we had methods to make observations outside our quantum world, we could see the other quantum worlds” would be correctly rejected by them as an argument from within predictivism; it is an argument from outside predictivism, and presumes that correspondence theories of truth can be defined meaningfully by imagining an account from outside the universe of how the things that we’ve observed have their own causal processes generating those observations, such that having thus identified the causal processes through observation, we may speak of unobservable but fully identified variables with no observable-to-us consequences such as the continued existence of distant galaxies and other quantum worlds.
One minor note is that, among the reasons I haven’t looked especially hard into the origins of “verificationism”(?) as a theory of meaning, is that I do in fact—as I understand it—explicitly deny this theory. The meaning of a statement is not the future experimental predictions that it brings about, nor isomorphic up to those predictions; all meaning about the causal universe derives from causal interactions with us, but you can have meaningful statements with no experimental consequences, for example: “Galaxies continue to exist after the expanding universe carries them over the horizon of observation from us.” For my actual theory of meaning see the “Physics and Causality” subsequence of Highly Advanced Epistemology 101 For Beginners.
That is: among the reasons why I am not more fascinated with the antecedents of my verificationist theory of meaning is that I explicitly reject a verificationist account of meaning.
My point is that plausible scenarios for Aligned AGI give you AGI that remains aligned only when run within power bounds, and this seems to me like one of the largest facts affecting the outcome of arms-race dynamics.
This all assumes that AGI does whatever its supposed operator wants it to do, and that other parties believe as much? I think the first part of this is very false, though the second part alas seems very realistic, so I think this misses the key thing that makes an AGI arms race lethal.
I expect that a dignified apocalypse looks like, “We could do limited things with this software and hope to not destroy the world, but as we ramp up the power and iterate the for-loops more times, the probability of destroying the world goes up along a logistic curve.” In “relatively optimistic” scenarios it will be obvious to operators and programmers that this curve is being ascended—that is, running the for-loops with higher bounds will produce an AGI with visibly greater social sophistication, increasing big-picture knowledge, visible crude attempts at subverting operators or escaping or replicating outside boxes, etc. We can then imagine the higher-ups demanding that crude patches be applied to get rid of the visible problems in order to ramp up the for-loops further, worrying that, if they don’t do this themselves, the Chinese will do that first with their stolen copy of the code. Somebody estimates a risk probability, somebody else tells them too bad, they need to take 5% more risk in order to keep up with the arms race. This resembles a nuclear arms race and deployment scenario where, even though there’s common knowledge that nuclear winter is a thing, you still end up with nuclear winter because people are instructed to incrementally deploy another 50 nuclear warheads at the cost of a 5% increase in triggering nuclear winter, and then the other side does the same. But this is at least a relatively more dignified death by poor Nash equilibrium, where people are taking everything as seriously as they took nuclear war back in the days when Presidents weren’t retired movie actors.
In less optimistic scenarios that realistically reflect the actual levels of understanding being displayed by programmers and managers in the most powerful organizations today, the programmers themselves just patch away the visible signs of impending doom and keep going, thinking that they have “debugged the software” rather than eliminated visible warning signs, being in denial for internal political reasons about how this is climbing a logistic probability curve towards ruin or how fast that curve is being climbed, not really having a lot of mental fun thinking about the doom they’re heading into and warding that off by saying, “But if we slow down, our competitors will catch up, and we don’t trust them to play nice” along of course with “Well, if Yudkowsky was right, we’re all dead anyways, so we may as well assume he was wrong”, and generally skipping straight to the fun part of running the AGI’s for-loops with as much computing power as is available to do the neatest possible things; and so we die in a less dignified fashion.
My point is that what you depict as multiple organizations worried about what other organizations will successfully do with an AGI being operated at maximum power, which is believed to do whatever its operator wants to do, reflects a scenario where everybody dies really fast, because they all share a mistaken optimistic belief about what happens when you operate AGIs at increasing capability. The real lethality of the arms race is that blowing past hopefully-visible warning signs or patching them out, and running your AGI at increasing power, creates an increasing risk of the whole world ending immediately. Your scenario is one where people don’t understand that and think that AGIs do whatever the operators want, so it’s a scenario where the outcome of the multipolar tensions is instant death as soon as the computing resources are sufficient for lethality.
Thank you very much! It seems worth distinguishing the concept invention from the name brainstorming, in a case like this one, but I now agree that Rob Miles invented the word itself.
The technical term corrigibility, coined by Robert Miles, was introduced to the AGI safety/alignment community in the 2015 paper MIRI/FHI paper titled Corrigibility.
Eg I’d suggest that to avoid confusion this kind of language should be something like “The technical term corrigibility, a name suggested by Robert Miles to denote concepts previously discussed at MIRI, was introduced...” &c.
Seems rather obvious to me that the sort of person who is like, “Oh, well, we can’t possibly work on this until later” will, come Later, be like, “Oh, well, it’s too late to start doing basic research now, we’ll have to work with whatever basic strategies we came up with already.”
Why do you think the term “corrigibility” was coined by Robert Miles? My autobiographical memory tends to be worryingly fallible, but I remember coining this term myself after some brainstorming (possibly at a MIRI workshop). This is a kind of thing that I usually try to avoid enforcing because it would look bad if all of the concepts that I did in fact invent were being cited as traceable to me—the truth about how much of this field I invented does not look good for the field or for humanity’s prospects—but outright errors of this sort should still be avoided, if an error it is.
Agent designs that provably meet more of them have since been developed, for example here.
First I’ve seen this paper, haven’t had a chance to look at it yet, would be very surprised if it fulfilled the claims made in the abstract. Those are very large claims and you should not take them at face value without a lot of careful looking.
Lots of people work for their privileges! I practiced writing for a LONG time—and remain continuously aware that other people cannot be expected to express their ideas clearly, even assuming their ideas to be clear, because I have Writing Privilege and they do not. Does my Writing Privilege have an innate component? Of course it does; my birth lottery placed me in a highly literate household full of actually good books, which combined with genuine genetic talent got me a 670 Verbal score on the pre-restandardized SAT at age eleven; but most teens with 670V SAT scores can’t express themselves at all clearly, and it was a long long time and a lot of practice before I started being able to express myself clearly ever even on special occasions. It remains a case of Privilege, and would be such even if I’d obtained it entirely by hard work starting from an IQ of exactly 100, not that this is possible, but if it were possible it would still be Privilege. People who study hard, work hard, compound their luck, and save up a lot of money, end up with Financial Privilege, and should keep that in mind before expecting less financially privileged friends to come with them on a non-expenses-paid fun friendly trip. We are all locally-Privileged in one aspect or another, even that kid at the center of Omelas, and all we can do is keep it in mind.
Is a bridge falling down the moment you finish building it an extreme and somewhat strange failure mode? In the space of all possible bridge designs, surely not. Most bridge designs fall over. But in the real world, you could win money all day betting that bridges won’t collapse the moment they’re finished.
Yeah, that kiiiinda relies on literally anybody anywhere being able to sketch a bridge that wouldn’t fall over, which is not the situation we are currently in.
But it feels to me like egregious misalignment is an extreme and somewhat strange failure mode and it should be possible to avoid it regardless of how the empirical facts shake out.
Paul, this seems a bizarre way to describe something that we agree is the default result of optimizing for almost anything (eg paperclips). Not only do I not understand what you actually did mean by this, it seems like phrasing that potentially leads astray other readers coming in for the first time. Say, if you imagine somebody at Deepmind coming in without a lot of prior acquaintance with the field—or some hapless innocent ordinary naive LessWrong reader who has a glowing brain, but not a galaxy brain, and who is taking Paul’s words for a lot of stuff about alignment because Paul has such a reassuring moderate tone compared to Eliezer—then they would come away from your paragraph thinking, “Oh, well, this isn’t something that happens if I take a giant model and train it to produce outputs that human raters score highly, because an ‘extreme and somewhat strange failure mode’ must surely require that I add on some unusual extra special code to my model in order to produce it.”
I suspect that you are talking in a way that leads a lot of people to vastly underestimate how difficult you think alignment is, because you’re assuming, in the background, exotic doing-stuff-right technology that does not exist, in order to prevent these “extreme and somewhat strange failure modes” from happening, as we agree they automatically would given any “naive” simple scheme, that you could actually sketch out concretely right now on paper. By which I mean, concretely enough that you could have any ordinary ML person understand in concrete enough detail that they could go write a skeleton of the code, as opposed to that you think you could later sketch out a research approach for doing. It’s not just a buffer overflow that’s the default for bad security, it’s the equivalent of a buffer overflow where nobody can right now exhibit how strange-failure-mode-avoiding code should concretely work in detail. “Strange” is a strange name for a behavior that is so much the default that it is an unsolved research problem to avoid it, even if you think that this research problem should definitely be solvable and it’s just something wrong or stupid about all of the approaches we could currently concretely code that would make them exhibit that behavior.
To answer your research question, in much the same way that in computer security any non-understood behavior of the system which violates our beliefs about how it’s supposed to work is a “bug” and very likely en route to an exploit—in the same way that OpenBSD treats every crash as a security problem, because the system is not supposed to crash and therefore any crash proves that our beliefs about the system are false and therefore our beliefs about its security may also be false because its behavior is not known—in AI safety, you would expect system security to rest on understandable system behaviors. In AGI alignment, I do not expect to be working in an adversarial environment unless things are already far past having been lost, so it’s a moot point. Predictability, stability, and control are the keys to exploit-resistance and this will be as true in AI as it is in computer security, with a few extremely limited exceptions in which randomness is deployed across a constrained and well-understood range of randomized behaviors with numerical parameters, much as memory locations and private keys are randomized in computer security without say randomizing the code. I hope this allows you to lay this research question to rest and move on.