Dwarkesh’s summary for part 1:
In terms of the depth and range of topics, this episode is the best I’ve done.
No part of my worldview is the same after talking with Carl Shulman. He’s the most interesting intellectual you’ve never heard of.
We ended up talking for 8 hours, so I’m splitting this episode into 2 parts.
This part is about Carl’s model of an intelligence explosion, which integrates everything from:
how fast algorithmic progress & hardware improvements in AI are happening,
what primate evolution suggests about the scaling hypothesis,
how soon before AIs could do large parts of AI research themselves, and whether there would be faster and faster doublings of AI researchers,
how quickly robots produced from existing factories could take over the economy.
We also discuss the odds of a takeover based on whether the AI is aligned before the intelligence explosion happens, and Carl explains why he’s more optimistic than Eliezer.
The next part, which I’ll release next week, is about all the specific mechanisms of an AI takeover, plus a whole bunch of other galaxy brain stuff.
Maybe 3 people in the world have thought as rigorously as Carl about so many interesting topics. So this was a huge pleasure.
Dwarkesh’s summary for part 2:
The second half of my 7 hour conversation with Carl Shulman is out!
My favorite part! And the one that had the biggest impact on my worldview.
Here, Carl lays out how an AI takeover might happen:
AI can threaten mutually assured destruction from bioweapons,
use cyber attacks to take over physical infrastructure,
build mechanical armies,
spread seed AIs we can never exterminate,
offer tech and other advantages to collaborating countries, etc
Plus we talk about a whole bunch of weird and interesting topics which Carl has thought about:
what is the far future best case scenario for humanity
what it would look like to have AI make thousands of years of intellectual progress in a month
how do we detect deception in superhuman models
does space warfare favor defense or offense
is a Malthusian state inevitable in the long run
why markets haven’t priced in explosive economic growth
& much more
Carl also explains how he developed such a rigorous, thoughtful, and interdisciplinary model of the biggest problems in the world.
(Also discussed on the EA Forum here.)
Carl, your vision (which I guess is fairly mainstream in AI alignment these days) involves having a powerful civilization be controlled by something weak and fragile (humans), but I worry that humans are both prone to responding in crazy ways to many kinds of inputs[1] and being optimized against by superintelligences (since they are rewarded/reinforced by evaluations from humans). I feel like a lot of the stuff you talk about in this podcast (mind reading, adversarial training) as being responsible for your optimism mostly just shifts probability mass from the more obvious bad outcomes (deception, takeover) to less obvious bad outcomes (humans going crazy, failing to converge to correct values). It’s hard for me to get nearly as hopeful as you seem, without being able to see some kind of solution that has the shape of “here are all the ways that humans are defective/unsafe, and ways to patch those holes or avoid them in human-AI interactions” (or some other kind of solution that avoids depending on humans for x-safety at all), unless “hopeful” is defined narrowly as solving intent-alignment or avoiding takeover.
To put it another way, being easily deceived is just one way that humans are defective/unsafe, but we (i.e., people in the AI alignment/x-risk space) seem to be focused on it almost to the complete exclusion of other human-AI safety problems, which seems bad from both a technical perspective (how do we solve the other problems if nobody is working on them) and social/governance perspective (society is liable to be too optimistic about the AI transition going well because nobody hears about the less obvious problems). As evidence for the neglect, I note that interstice’s comment to Paul Christiano’s My views on “doom” remains unanswered:
for example, getting addicted, believing in obviously crazy conspiracy theories, refusing to speak or act in accordance with obvious truths because they’re low status / outside the overton window, getting sucked into non-effective or counterproductive causes
As we’ve discussed and in short, I think aligned AI permits dialing up many of the processes that make science or prediction markets imperfectly self-correcting: tremendously cheaper, in parallel, on the full panoply of questions (including philosophy and the social sciences), with robust consistency, cross-examination, test sets, and forecasting. These sorts of things are an important part of scalable supervision for alignment, but if they can be made to work I expect them to drive strong epistemic convergence.
You’ve described some of these ideas to me before, but not in enough detail for me to form a judgement on the actual merits of the ideas and arguments. So I’m having to rely on my usual skeptical prior for new untested ideas in the philosophical or AI safety areas (because a lot of new ideas end up not working out, and people tend to be overconfident about their own original ideas), along with:
We seem to understand the philosophy/epistemology of science much better than that of philosophy (i.e. metaphilosophy), and at least superficially the methods humans use to make progress in them don’t look very similar, so it seems suspicious that the same AI-based methods happen to work equally well for science and for philosophy. (I tried to understand/explain the difference between the two and why philosophy might be especially difficult or problematic for ML in Some Thoughts on Metaphilosophy. I’m not necessarily putting a lot of stock in my specific explanations but it would be a pretty big surprise to learn that it turns out they’re really the same.)
So to the extent that your overall optimism is based on optimism about these new ideas, I think I have to discount much of that, at least until I see a reasonably detailed write-up.
Per Faynman, understanding is the ability to practice. Or an ability to construct a practicing AI?
Maybe we understand philosophy of science/epistemology reasonably well, but we don’t yet know to construct a universal AI scientist.
Philosophy is a nebulous concept with multiple overlapping meanings, and as a practice, doesn’t have a single crisp functional conceptualisation, like those you enumerated in your post on metaphilosophy. It’s rather some superposition of these conceptualisations. But the most “important” conceptualisation, it seems to me, is that philosophy are coherent stories that help people craft shared meaning and motivation. This is pretty evident in religion-as-philosophy, moral philosophy, and most “applied” philosophies, from philosophy of physics to philosophy of art, and metaphilosophy itself. We discussed this point here: https://www.lesswrong.com/posts/k93NEoXZq6CdXegdx/philosophical-cyborg-part-1. So, I actually disagree that we as much “in the dark” on the nature of philosophy as you present. I feel that I understand philosophy not much worse than I understand epistemology/science (albeit still somewhat worse).
This functional role of philosophy AI is already very good at, in many cases, albeit if we task it to come up with a completely novel interpretation of quantum mechanics it will still struggle, probably.
I didn’t understand why the question of whether AI can practice good philosophy is very relevant to p(doom), though.
I would agree with “philosophy are coherent stories that help people craft shared meaning and motivation” but I think it’s not merely this. The position that it is merely this (and therefore philosophical questions do not have answers that can be true or false) would be self-undermining, because if there is no true or false in metaphilosophy, then your own metaphilosophical position can’t be true (providing me with no reason to accept it on that basis), and thinking that all or most philosophical questions that I care about have no truth value would be rather demotivating.
Consider instead the following alternative metaphilosophical position (that I tentatively hold): Philosophy may be many things (including “coherent stories that help people craft shared meaning and motivation”) but it is also a method of answering confusing questions, that (at least some) humans seem to possess but can’t yet understand or explain, and many philosophical questions do seem to have answers that can be true or false. What is incoherent or not motivating about this? What is wrong with this in general?
This is exactly what I meant as well, in “Philosophy is a nebulous concept with multiple overlapping meanings, and as a practice, doesn’t have a single crisp functional conceptualisation, like those you enumerated in your post on metaphilosophy. It’s rather some superposition of these conceptualisations.”
Let’s first disentangle two questions: “Why people are practicing philosophy?” and “What is the (truth) status of philosophical statements?”.
The first one is the question of anthropology and (social) psychology which, I could argue, are kind of “philosophies” rather than “sciences” themselves, at this moment: they attempt to explain existing evidence (with “coherent stories”), but don’t produce mathematical models with good predictive power. Nevertheless, I agree that there are multiple explanations that we can provide, from “finding answers to difficult questions” and “by practicing philosophy, people try to increase the coherence of their world models, which is a deep ‘motivation’ of conscious biological agents, and which became imported into the domain of words 80k years ago or whenever people have acquired compositional language”, to “showing off”.
As retrospective explanation, which doesn’t inform us much to why will AI be motivated to practice philosophy. As an algorithm for finding more coherent and parsimonious sets of symbols/concepts (ontologies), it may be implemented in LM-like AI, perhaps, or there may be other, more efficient algorithms for this.
To the second question, about the status of philosophical questions, I think the answer is the following: while the essence of a scientific theory is a mathematical model which is judged by the quality of its match with evidence and coherence with adjacent scientific (mathematical) models, philosophy is text which is judged by its coherence with scientific models, internal (linguistic) coherence, and coherence with other philosophical texts (arbitrarily selected by a philosopher, as well as a scientist arbitrary selects theories why want their theory to match well with, thus advancing this or that scientific paradigm).
Internal coherence of a text and its coherence with other texts is a question of (neuro)semiotics and linguistics/philosophy of language, both of which, in my mind, are branches of cognitive science. If there is something else that makes texts convincing to people apart from their coherence, apart from external factors such as the likability and the authority of the author of the text or the orator, then the “quality” of philosophical texts also becomes the question of neuropsychology more generally.
The above story about coherence applies to most kinds of philosophy except “foundational”, such as foundations of physics or “philosophical paradigms” such as pragmatism, which perhaps serve as capstones for large sets of other scientific and philosophical theories, and the merit of these “foundational” philosophies is judged by the overall coherence of these sets of other theories being capstoned.
I don’t like using the word “truth” outside of logic, but if I’m forced to, the above implies that I go with some version of the coherence theory of truth.
Before the invention of logic, someone might have said the same thing about math, that nothing determines the “quality” of a proof, aside from how convincing human neuropsychology happens to find it. I’m not saying that for sure philosophy is the same or analogous, that we’ll definitely find deeper reasons than neuropsychology for why a philosophical text is correct or convincing, but neither do I know how to rule that out, which makes me uncertain.
Plus, intuitively it seems like when trying to answer philosophical questions, I’m often aiming for some truth more “real” or “objective” than merely coherence with scientific models and arbitrarily selected other texts. For example, it seems either objectively true or objectively false that nothing determines the quality of a philosophical text aside from coherence and neuropsychology. The truth value of this statement doesn’t seem to depend on what other texts I happen to select to try to make it cohere with, or other subjective factors.
Does it seem to you that the kinds of people who are good at science vs good at philosophy (or the kinds of reasoning processes they use) are especially different?
In your own case, it seems to me like you’re someone who’s good at philosophy, but you’re also good at more “mundane” technical tasks like programming and cryptography. Do you think this is a coincidence?
I would guess that there’s a common factor of intelligence + being a careful thinker. Would you guess that we can mechanize the intelligence part but not the careful thinking part?
A lot of people are way better than me at technical tasks (at some point I wanted to go into cryptography research as a career but had stick with applied cryptography which is less technically demanding), but way worse at philosophy (or at least have shown little interest in philosophy which itself seems like a major philosophical error). I don’t know how to explain this if science and philosophy are really the same thing or use same methods.
It probably has to be more than that, because lots of people in cryptography (and security in general) are highly intelligent and careful thinkers (how else can they survive in those fields), but again AFAICT most people in those fields are not particularly good philosophers. Maybe at least one necessary additional ingredient is “good philosophical intuitions” (for example you have to have an intuition that philosophy is important before you would even start thinking about it) but I have little idea how to break that down further.
Here how you can talk about bombing unlicensed datacenters without using “strike” and “bombing”.
Here is alternative to dangerous experiments to develop enhanced cognition in humans. Sounds less extreme and little more doable.
Also pretty bizarre that in response to
The answer was: yes, but we are doing it anyway. But with a twists like adversarial examples, adversarial training and simulations. If Shulman had THE ANSWER to Alignment problem then he would not kept it secret, but i cant help but feel some disappointment, because he sounds SO hopeful and confident. I somehow expected something different than variation of “we are going to us weaker AIs to help us to align stronger AIs while trying to outrun capabilities research teams”. Even if this variation (in his description) seems very sophisticated with mind reading and inducing hallucinations.
The thing was already an obscene 7 hours with a focus on intelligence explosion and mechanics of AI takeover (which are under-discussed in the discourse and easy to improve on, so I wanted to get concrete details out). More detail on alignment plans and human-AI joint societies are planned focus areas for the next times I do podcasts.
I think it’s pretty easy to talk about bombing without saying “bombing”, it’s just… less clear. (depending on how you do it and how sustained it is, it feels orwellian and dishonest. I think Carl’s phrasing here is fine but I do want someone somewhere being clear about what’d be required)
(It seems plausibly an actually-good strategy to have Eliezer off saying extreme/clear things and moving the overton window while other people say reasonable-at-first-glance sounding things)
I wanted to read through the transcript but found it a bit overwhelming. I made two google docs for the Pt 1 and Pt 2 transcript, where people can comment on bits that seemed particularly interesting. This seemed maybe helpful for an impromptu reading group.
Part 1
Part 2
I’ve only listened to part 1 so far, and I found the discussion of intelligence explosion to be especially fresh. (That’s hard to do given the flood of AI takes!) In particular (from memory, so I apologize for errors):
By analogy to chip compute scaling as a function of researcher population, it makes super-exponential growth seem possible if AI-compute-increase is substituted for researcher-population-increase. A particularly interesting aspect of this is that the answer could have come out the other way if the numbers had worked out differently as Moore’s law progressed. (It’s always nice to give reality a chance to prove you wrong.)
The intelligence explosion starts before human-level AI. But I was left wanting to know more: if so, how do we know when we’ve crossed the inflection point into the intelligence explosion? Is it possible that we’re already in an intelligence explosion, since AlexNet, or Google’s founding, or the creation of the internet, or even the invention of digital computers? And I thought Patel’s point about the difficulty of automating a “portfolio of tasks” was great and not entirely addressed.
The view of intelligence explosion as consisting concretely of increases in AI researcher productivity, though I’ve seen it observed elsewhere, was good to hear again. It helps connect the abstract concept of intelligence explosion to how it could play out in the real world.
Are there any recommended readings for this point in particular? I tried searching for Shulman’s writing on the topic but came up empty. (Sorry if I missed some!)
This seems to me a key point that most discourse on AI/AGI overlooks. For example, LeCun argues that, at current rates of progress, human-level AI is 30+ years away (if I remember him correctly). He could be right about the technological distance yet wrong about the temporal distance if AI R&D is dramatically sped up by an intelligence explosion ahead of the HLAI milestone.
It also seems like a non-obvious point. For example, when I. J. Good coined the term “intelligence explosion”, it was conceived as the result of designing an ultraintelligent machine. So for explosion to precede superintelligence flips the original concept on its head.
That’s not quite right. What Good (a mathematician) is actually arguing is an existence/upper-bound argument:
That is, he is giving an upper-bound condition which “unquestionably” leads to an intelligence explosion: assume an AI superior to “any man however clever”; designing the ultraintelligent machine, if it exists, must have been done by either a less intelligent machine (in which case the argument is already over) or the cleverest man or some less clever man; therefore, the ultraintelligent machine must by definition be able to design a better ultraintelligent machine (since it is superior to even the cleverest man). If you did not argue that way by invoking an ultraintelligent machine, you would have only a weaker argument: perhaps the AI is as clever as the second-most clever man, but it actually needs to be as clever as the most clever man—then it would be questionable whether the intelligence explosion is possible. (And people do in fact make this kind of argument routinely, often with a computational-complexity coating.) So he simply makes a stronger assumption to close the loophole and get on with more interesting things than debating the bare possibility.
His immediately following discussion of economics and human-AI hybrid systems suggests that he doesn’t deny that merely-ordinary-human level intelligences could potentially do an intelligence explosion, and indeed, probably strongly suspects it is possible, but he just thinks that the question is irrelevant because it is too unstable an equilibrium: the payoff from merely human-level AIs is too low, and the payoff from a true ultraintelligence is so absurdly high that anyone with a human-level AI would intend to simply spend the extra money to increase its compute power to ultraintelligence* and then kick off an intelligence explosion for sure, per his previous proof. There would be little waiting around for inefficiently-human-level AIs to dick around long enough improving themselves autonomously to reach ultraintelligence. See his earlier paper where he also puts emphasis on the ‘unquestionability’ of the argument, but phrased a bit differently in terms of matching a ‘Newton’, or this part where he makes it explicitly and also makes a very familiar-sounding compute-overhang argument (I believe this is the first time this particular I. J. Good quote has been highlighted, given how hard it is to find these papers):
* interesting difference here: he seems to think that it would only require, say, 2-3x more computing power, and thus only 2-3x more budget, to go from human-level intelligence to ultraintelligence, noting that the human brain is about that much of a factor larger than a chimpanzee brain. This is a reasonable claim: if an AI company like OA could spend $300m instead of $100m to make a GPT-5-scale model strictly superhuman rather than GPT-4-performance level, don’t you think at least one of them would do so in a heartbeat? But people today would probably argue, on the basis of all the power-laws, that it would require much more than that because all your budget is going into training (as opposed to the 2-3x greater computing power to simply run it) and you would need more like 100x the budget. This is a disagreement worth pondering.
“interesting difference here: he seems to think that it would only require, say, 2-3x more computing power, and thus only 2-3x more budget, to go from human-level intelligence to ultraintelligence, noting that the human brain is about that much of a factor larger than a chimpanzee brain”
That seems obviously unjustified. I expect you can reduce a human brain by 2x and still have them be essentially human. The metric could be the max intelligence you can get with a chimp number of neurons/synapses which I expect is probably an IQ 80+ human. There was an “overhang” situation with chimp brains where they could have been optimized to be much better at abstract thought with the same brainpower but it didn’t happen.