Philosophy PhD student, worked at AI Impacts, now works at Center on Long-Term Risk. Research interests include acausal trade, timelines, takeoff speeds & scenarios, decision theory, history, and a bunch of other stuff. I subscribe to Crocker’s Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html
Daniel Kokotajlo
I second Tailcalled’s question, and would precisify it: When did you first code up this simulation/experiment and run it (or a preliminary version of it)? A week ago? A month ago? Three months ago? A year ago?
Well then, would you agree that Evan’s position here:
By default, in the case of deception, my expectation is that we won’t get a warning shot at all
is plausible and in particular doesn’t depend on believing in a discontinuity, at least not the kind of discontinuity we should consider unlikely? If so, then we are all on the same page. If not, then we can rehash our argument focusing on this “obvious, real-world harm” definition, which is noticeably broader than my “strong” definition and therefore makes Evan’s claim stronger and less plausible but still, I think, plausible.
(To answer your earlier question, I’ve read and spoken to several people who seem to take the attempted-world-takeover warning shot scenario seriously, i.e. people who think there’s a good chance we’ll get “strong” warning shots. Paul Christiano, for example. Though it’s possible I was misunderstanding him. I originally interpreted you as maybe being one of those people, though now it seems that you are not? At any rate these people exist.)
EDIT: I feel like we’ve been talking past each other for much of this conversation and in an effort to prevent that from continuing to happen, perhaps instead of answering my questions above, we should just get quantitiative. Consider a spectrum of warning shots from very minor to very major. Put a few examples on the spectrum for illustration. Then draw a credence distribution for probability that we’ll have warning shots of this kind. Maybe it’ll turn out that our distributions aren’t that different from each other after all, especially if we conditionalize on slow takeoff.
It’s been a while since I thought about this, but going back to the beginning of this thread:
“It’s unlikely you’ll get a warning shot for deceptive alignment, since if the first advanced AI system is deceptive and that deception is missed during training, once it’s deployed it’s likely for all the different deceptively aligned systems to be able to relatively easily coordinate with each other to defect simultaneously and ensure that their defection is unrecoverable (e.g. Paul’s “cascading failures”).”
At a high level, you’re claiming that we don’t get a warning shot because there’s a discontinuity in capability of the aggregate of AI systems (the aggregate goes from “can barely do anything deceptive” to “can coordinate to properly execute a treacherous turn”).
I think all the standard arguments against discontinuities can apply just as well to the aggregate of AI systems as they can to individual AI systems, so I don’t find your argument here compelling.
I think the first paragraph (Evan’s) is basically right, and the second two paragraphs (your response) are basically wrong. I don’t think this has anything to do with discontinuities, at least not the kind of discontinuities that are unlikely. (Compare to the mutiny analogy). I think that this distinction between “strong” warning shots and “weak” warning shots is important because I think that “weak” warning shots will probably only provoke a moderate increase in caution on the part of human institutions and AI projects, whereas “strong” warning shots would provoke a large increase in caution. I agree that we’ll probably get various “weak” warning shots, but I think this doesn’t change the overall picture much because it won’t provoke a major increase in caution on the part of human institutions etc.
I’m guessing it’s that last bit that is the crux—perhaps you think that it would actually provoke a major increase in caution, comparable to the increase we’d get if an AI tried and failed to take over, in which case this minor warning shot vs. major warning shot distinction doesn’t matter much.
Yep! I prefer my terminology but it’s basically the same concept I think.
I disagree; I think we go astray by counting things like thermostats as agents. I’m proposing that this particular feedback loop I diagrammed is really important, a much more interesting phenomenon to study than the more general category of feedback loop that includes thermostats.
Years after I first thought of it, I continue to think that this chain reaction is the core of what it means for something to be an agent, AND why agency is such a big deal, the sort of thing we should expect to arise and outcompete non-agents. Here’s a diagram:
Roughly, plans are necessary for generalizing to new situations, for being competitive in contests for which there hasn’t been time for natural selection to do lots of optimization of policies. But plans are only as good as the knowledge they are based on. And knowledge doesn’t come a priori; it needs to be learned from data. And, crucially, data is of varying quality, because it’s almost always irrelevant/unimportant. High-quality data, the kind that gives you useful knowledge, is hard to come by. Indeed, you may need to make a plan for how to get it. (Or more generally, being better at making plans makes you better at getting higher-quality data, which makes you more knowledgeable, which makes your plans better.)
Thanks for doing this! I’m excited to see this sequence grow, it’s the sort of thing that could serve the function of a journal or textbook.
OK, thanks. YMMV but some people I’ve read / talked to seem to think that before we have successful world-takeover attempts, we’ll have unsuccessful ones—”sordid stumbles.” If this is true, it’s good news, because it makes it a LOT easier to prevent successful attempts. Alas it is not true.
A much weaker version of something like this may be true, e.g. the warning shot story you proposed a while back about customer service bots being willingly scammed. It’s plausible to me that we’ll get stuff like that before it’s too late.
If you think there’s something we are not on the same page about here—perhaps what you were hinting at with your final sentence—I’d be interested to hear it.
I’m probably being just mathematically confused myself; at any rate, I’ll proceed with the p[Tk & e+] : p[Tk & e-] version since that comes more naturally to me. (I think of it like: Your credence in Tk is split between two buckets, the Tk&e+ and Tk&e- bucket, and then when you update you rule out the e- bucket. So what matters is the ratio between the buckets; if it’s relatively high (compared to the ratio for other Tx’s) your credence in Tk goes up, if it’s relatively low it goes down.
Anyhow, I totally agree that this ratio matters and that it varies with k. In particular here’s how I think it should vary for most readers of my post:
for k>12, the ratio should be low, like 0.1.
for low k, the ratio should be higher.
for middling k, say 6<k<13, the ratio should be in between.
Thus, the update should actually shift probability mass disproportionately to the lower k hypotheses.
I realize we are sort of arguing in circles now. I feel like we are making progress though. Also, separately, want to hop on a call with me sometime to sort this out? I’ve got some more arguments to show you...
Thanks for this post!
If some of the more pessimistic projections about the timelines to TAI are realized, my efforts in this field will have no effect. It is going to take at least 30 years for dramatically more capable humans to be able to meaningfully contribute to work in this field. Using Ajeya Cotra’s estimate of the timeline to TAI, which estimates a 50% chance of TAI by 2052, I estimate that there is at most a 50% probability that these efforts will have an impact, and a ~25% chance that they will have a large impact.
Those odds are good enough for me.
How low would the odds have to be before you would switch to doing something else? Would you continue with your current plan if the odds were 20-10 instead of 50-25?
So what ends up mattering is the ratio p[Tk | e+] : p[Tk | e-]
I’m claiming that this ratio is likely to vary with k.Wait, shouldn’t it be the ratio p[Tk & e+] : p[Tk & e-]? Maybe both ratios work fine for our purposes, but I certainly find it more natural to think in terms of &.
Thanks for this, this is awesome! I’m hopeful in the next few years for there to be a collection of stories like this.
This is a story where the alignment problem is somewhat harder than I expect, society handles AI more competently than I expect, and the outcome is worse than I expect. It also involves inner alignment turning out to be a surprisingly small problem. Maybe the story is 10-20th percentile on each of those axes.
I’m a bit surprised that the outcome is worse than you expect, considering that this scenario is “easy mode” for societal competence and inner alignment, which seem to me to be very important parts of the overall problem. Am I right to infer that you think outer alignment is the bulk of the alignment problem, more difficult than inner alignment and societal competence?
Some other threads to pull on:
--In this story, there aren’t any major actual wars, just simulated wars / war games. Right? Why is that? I look at the historical base rate of wars, and my intuitive model adds to that by saying that during times of rapid technological change it’s more likely that various factions will get various advantages (or even just think they have advantages) that make them want to try something risky. OTOH we haven’t had major war for seventy years, and maybe that’s because of nukes + other factors, and maybe nukes + other factors will still persist through the period of takeoff? IDK, I worry that the reasons why we haven’t had war for seventy years may be largely luck / observer selection effects, and also separately even if that’s wrong, I worry that the reasons won’t persist through takeoff (e.g. some factions may develop ways to shoot down ICBMs, or prevent their launch in the first place, or may not care so much if there is nuclear winter)
--Relatedly, in this story the AIs seem to be mostly on the same team? What do you think is going on “under the hood” so to speak: Have they all coordinated (perhaps without even causally communicating) to cut the humans out of control of the future? Why aren’t they fighting each other as well as the humans? Or maybe they do fight each other but you didn’t focus on that aspect of the story because it’s less relevant to us?
--Yeah, society will very likely not be that competent IMO. I think that’s the biggest implausibility of this story so far.
--(Perhaps relatedly) I feel like when takeoff is that distributed, there will be at least some people/factions who create agenty AI systems that aren’t even as superficially aligned as the unaligned benchmark. They won’t even be trying to make things look good according to human judgment, much less augmented human judgment! For example, some AI scientists today seem to think that all we need to do is make our AI curious and then everything will work out fine. Others seem to think that it’s right and proper for humans to be killed and replaced by machines. Others will try strategies even more naive than the unaligned benchmark, such as putting their AI through some “ethics training” dataset, or warning their AI “If you try anything I’ll unplug you.” (I’m optimistic that these particular failure modes will have been mostly prevented via awareness-raising before takeoff, but I do a pessimistic meta-induction and infer there will be other failure modes that are not prevented in time.)
--Can you say more about how “the failure modes in this story are an important input into treachery?”
On the contrary, the graph of launch costs you link seems to depict Falcon 9 as a 15-ish-year discontinuity in cost to orbit; I think you are misled by the projection, which is based on hypothetical future systems rather than on extrapolating from actual existing systems.
I’m betting that a little buzz on my phone which I can dismiss with a tap won’t kill my focus. We’ll see.
Productivity app idea:
You set a schedule of times you want to be productive, and a frequency, and then it rings you at random (but with that frequency) to bug you with questions like:
--Are you “in the zone” right now? [Y] [N]
--(if no) What are you doing? [text box] [common answer] [ common answer] [...]
The point is to cheaply collect data about when you are most productive and what your main time-wasters are, while also giving you gentle nudges to stop procrastinating/browsing/daydream/doomscrolling/working-sluggishly, take a deep breath, reconsider your priorities for the day, and start afresh.
Probably wouldn’t work for most people but it feels like it might for me.
No judgment; quite the opposite! You are displaying far above-average concern for the truth and for others by admitting your mistake and seeking to correct it. (Arguably, even noticing your mistake already makes you above-average, since most would rationalize it away.)
That seems plausible, but AIs can have pointers to learning from other humans too. E.g. GPT-3 read the Internet, if we were making some more complicated system it could evolve pointers analogous to the human pointers. I think.
Yeah, somewhere along that spectrum. Generally speaking, I’m skeptical of claims that we know a lot about the brain.
“(And I think the evidence against it is mounting, this being one of the key pieces.)”
(I still don’t see why.)
--I wouldn’t characterize my own position as “we know a lot about the brain.” I think we should taboo “a lot.”
--We are at an impasse here I guess—I think there’s mounting evidence that brains use predictive coding and mounting evidence that predictive coding is like backprop. I agree it’s not conclusive but this paper seems to be pushing in that direction and there are others like it IIRC. I’m guessing you just are significantly more skeptical of both predictive coding and the predictive coding --> backprop link than I am… perhaps because the other hypotheses on my list are less plausible to you?
I guess I was thinking: Brains use predictive coding, and predictive coding is basically backprop, so brains can’t be using something dramatically better than backprop. You are objecting to the “brains use predictive coding” step? Or are you objecting that only one particular version of predictive coding is basically backprop?
But we also know there are algorithms which are way more data-efficient than NNs (while being more processing-power intensive). So wouldn’t the obvious conclusion from our observations be: humans don’t use backprop, but rather, use more data-efficient algorithms?
Are you referring to Solomonoff Induction and the like? I think the “brains use more data-efficient algorithms” is an obvious hypothesis but not an obvious conclusion—there are several competing hypotheses, outlined above. (And I think the evidence against it is mounting, this being one of the key pieces.)
I’ll grant, I’m now quite curious how the scaling argument works out. Is it plausible that human-brain-sized NNs are as data-efficient as humans?
In terms of bits/pixels/etc., humans see plenty of data in their lifetime, a bit more than the scaling laws would predict IIRC. But the scaling laws (as interpreted by Ajeya, Rohin, etc.) are about the amount of subjective time the model needs to run before you can evaluate the result. If we assume for humans it’s something like 1 second on average (because our brains are evaluating-and-updating weights etc. on about that timescale) then we have a mere 10^9 data points, which is something like 4 OOMs less than the scaling laws would predict. If instead we think it’s longer, then the gap in data-efficiency grows.
Some issues though. One, the scaling laws might not be the same for all architectures. Maybe if your context window is bigger, or your use recurrency, or whatever, the laws are different. Too early to tell, at least for me (maybe others have more confident opinions, I’d love to hear them!) Two, some data is higher-quality than other data, and plausibly human data is higher-quality than the stuff GPT-3 was fed—e.g. humans deliberately seek out data that teaches them stuff they want to know, instead of just dully staring at a firehose of random stuff. Three, it’s not clear how to apply this to humans anyway. Maybe our neurons are updating a hundred times a second or something.
I’d be pretty surprised if a human-brain-sized Transformer was able to get as good as a human at most important human tasks simply by seeing a firehose of 10^9 images or context windows of internet data. But I’d also be pretty surprised (10%) if the scaling laws turn out to be so universal that we can’t get around them; if it turns out that transformative tasks really do require a NN at least the size of a human brain trained for at least 10^14 steps or so where each step involves running the NN for at least a subjective week. (Subjective second, I’d find more plausible. Or subjective week (or longer) but with fewer than 10^14 steps.)
Update: After talking to various people, it appears that (contrary to what the poll would suggest) there are at least a few people who answer Question 2 (all three variants) with less than 80%. In light of those conversations, and more thinking on my own, here is my current hot take on how +12 OOMs could turn out to not be enough:
1. Maybe the scaling laws will break. Just because GPT performance has fit a steady line across 5 orders of magnitude so far (or whatever) doesn’t mean it will continue for another 5. Maybe it’ll level off for some reason we don’t yet understand. Arguably this is what happened with LSTMs? Anyhow, for timelines purposes what matters is not whether it’ll level off by the time we are spending +12 OOMs of compute, but rather more like whether it will level off by the time we are spending +6 OOMs of compute. I think it’s rather unlikely to level off that soon, but it might. Maybe 20% chance. If this happens, then probably Amp(GPT-7) and the like wouldn’t work. (80%?) The others are less impacted, but maybe we can assume OmegaStar probably won’t work either. Crystal Nights, SkunkWorks, and Neuromorph… don’t seem to be affected by scaling laws though. If this were the only consideration, my credence would be something like 15% chance that Crystal Nights and OmegaStar don’t work, and then independently, maybe 30% chance that none of the others work too, for a total of 95% answer to Question Two… :/ I could fairly easily be convinced that it’s more like a 40% chance instead of 15% chance, in which case my answer is still something like 85%… :(
2. Maybe the horizon length framework plus scaling laws really will turn out to be a lot more solid than I think. In other words, maybe +12 OOMs is enough to get us some really cool chatbots and whatnot but not anything transformative or PONR-inducing; for those tasks we need long-horizon training… (Medium-horizons can be handled by +12 OOMs). Unsurprisingly to those who’ve read my sequence on takeoff and takeover, I do not think this is very plausible; I’m gonna say something like 10%. (Remember it has to not just apply to standard ML stuff like OmegaStar, but also to amplified GPT-7 and also to Crystal Nights and whatnot. It has to be basically an Iron Law of Learning.) Happily this is independent of point 1 though so that makes for total answer to Q2 of something more like 85%
3. There’s always unknown unknowns. I include “maybe we are data-limited” in this category. Or maybe it turns out that +12 OOMs is enough, and actually +8 OOMs is enough, but we just don’t have the industrial capacity or energy production capacity to scale up nearly that far in the next 20 years or so. I prefer to think of these things as add-ons to the model that shift our timelines back by a couple years, rather than as things that change our answer to Question Two. Unknown unknowns that change our answer to Question Two seem like, well, the thing I mentioned in the text—maybe there’s some super special special sauce that not even Crystal Nights or Neuromorph can find etc. etc. and also Skunkworks turns out to be useless. Yeah… I’m gonna put 5% in this category. Total answer to Question Two is 80% and I’m feeling pretty reasonable about it.
4. There’s biases. Some people I talked to basically said “Yeah, 80%+ seems right to me too, but I think we should correct for biases and assume everything is more difficult than it appears; if it seems like it’s very likely enough to us, that means it’s 50% likely enough.” I don’t currently endorse this, because I think that the biases pushing in the opposite direction—biases of respectability, anti-weirdness, optimism, etc.--are probably on the whole stronger. Also, the people in the past who used human brain milestone to forecast AI seem to have been surprisingly right, of course it’s too early to say but reality really is looking exactly like it should look if they were right...
5. There’s deference to the opinions of others, e.g. AI scientists in academia, economists forecasting GWP trends, the financial markets… My general response is “fuck that.” If you are interested I can say more about why I feel this way; ultimately I do in fact make a mild longer-timelines update as a result of this but I do so grudgingly. Also, Roodman’s model actually predicts 2037, not 2047. And that’s not even taking into account how AI-PONR will probably be a few years beforehand!
So all in all my credence has gone down from 90% to 80% for Question Two, but I’ve also become more confident that I’m basically correct, that I’m not the crazy one here. Because now I understand the arguments people gave, the models people have, for why the number might be less than 80%, and I have evaluated them and they don’t seem that strong.
I’d love to hear more thoughts and takes by the way, if you have any please comment!