Thanks for making this more concrete! I still disagree with you, but this list provides a good way of articulating why.
Of these, I think the initial case for AI takeover risk was the most impactful by far, and also a great example of introducing a new ontology (which includes concepts like AGI and superintelligence, the orthogonality thesis, instrumental convergence, recursive self-improvement, the nearest unblocked strategy problem, corrigibility, alignment, reward tampering, etc).
The simulation argument, by contrast, is an interesting example of an idea that is nominally very “big if true” but has actually had very little impact. And one reason why I think this is the case is because it doesn’t really change our ontologies much. We add the concept of the universe being a simulation, but in order to understand what’s inside the simulation we still use all our old concepts, and we have almost no new concepts that help us think about what’s outside the simulation. (Off the top of my head, “ancestor simulation” is the only additional novel concept I can recall related to this hypothesis.)
Existential risk and longtermism fall somewhere in the middle. They are also “big if true”, and they have been fleshed out to some degree (e.g. with concepts like the vulnerable world hypothesis, astronomical waste, value lock-in, etc). But I don’t think it’s a coincidence that the vast majority of work under these headings has been on preventing AI takeover. Outside that bucket, I’d say these ontologies are kinda “thin”, which has made it difficult to use them to do substantive work without funneling that work through better-developed ontologies (like AI risk).
What do I mean by “thin”? Roughly two things. Firstly, it’s unclear if the concepts in them are well-defined or well-grounded. For people on LessWrong, a good way to get a sense of such concepts is from reading continental philosophy, where people often use words in vague ways that feel like they’re mostly trying to convey a vibe not a precise meaning. Now, analytic philosophers try to be more precise than continentals. But when reasoning about weird futures, conceptual clarity is extremely difficult to achieve; from a future perspective most strategy/futurism researchers will likely seem as confused as continental philosophers seem to LWers. For example, a concept like “extinction” seems very easy to define. But actually when you start to think about possible edge cases (like: are humans extinct if we’ve all uploaded? Are we extinct if we’ve all genetically modified into transhumans? Are we extinct if we’re all in cryonics and will wake up later?) it starts to seem possible that maybe “almost all of the action” is in the parts of the concept that we haven’t pinned down.
And that becomes much more true when working with far more abstract concepts like “lock-in”. If you ask me “will the values that rule the future be locked in?” the majority of my probability mass is on “something will happen which is a very non-central example of either ‘yes’ or ‘no’ in your current ontology—in other words, your conception of locked-in is too vague for the question to be meaningful”. For more on this, see my post on why I’m not a bayesian. Other concepts which I think are too vague or confused to be very useful for doing strategy research: AGI, LLMs (at what point have we added so much image/video/RL data that summarizing them as “language models” is actively misleading?), “human power grabs” (I expect that there will be strong ambiguity about the extent to which AIs are ‘responsible’), “societal epistemics”, “alignment”, “metaphilosophy”, and a range of others.
Secondly, though, even when concepts are well-defined, are they useful? In mathematics you can easily generate concepts which are well-defined on a technical level but totally uninteresting. One intuition pump here is personality tests: it’s very easy to score people’s personalities on a bunch of different axes, but most of the time those axes are so arbitrary that it’s hard to “hook them in” to the rest of our knowledge about how people think. Similarly, even insofar as DeepMind’s “levels of AGI” are well-defined, they’re just so arbitrary that we shouldn’t expect these concepts to carve reality at its joints. But these kinds of frameworks and taxonomies abound in futurism/strategy research.
And so, instead of trying to produce “strategy research”, I claim people should try to do the kinds of work that produced powerful ontologies in the past—whether that’s inventing theoretical frameworks like probability theory/information theory/computation theory, doing empirical research to produce new scientific fields, doing the kinds of philosophy that produced our current political ontology, etc.
Thanks for articulating your view in such detail. (This was written with transcription software. Sorry if there are mistakes!)
AI risk:
When I articulate the case for AI takeover risk to people I know, I don’t find the need to introduce them to new ontologies. I can just say that AI will be way smarter than humans. It will want things different from what humans want, and so it will want to seize power from us.
But I think I agree that if you want to actually do technical work to reduce the risks, that it is useful to have new concepts that point out why the risk might arise. I think reward hacking, instrumental convergence, and corrigibility are good examples.
To me, this seems like a case where you can identify a new risk without inventing a new ontology, but it’s plausible that you need to make ontological progress to solve the problem.
Simulations:
On the simulation argument, I think that people do in fact reason about the implications of simulations for example thinking about a-causal trade or threat dynamics. So I don’t think that it hasn’t gone anywhere. It obviously hasn’t become very practical yet, but I wouldn’t think that that’s due to the nature of the concept vs the inherent subject matter.
I don’t really understand why we would need new concepts to think about what’s outside a simulation rather than just applying our existing concepts that we use to describe the physical world outside of simulations within our universe and to describe other ways that the universe could have been.
LTism:
(e.g. with concepts like the vulnerable world hypothesis, astronomical waste, value lock-in, etc
Okay, it’s helpful to know that you see these as providing new valuable ontologies to some extent.
In my mind, there is not much ontological innovation going on in these concepts, because they can be stated in one sentence using pre-existing concepts. Vulnerable world hypothesis is the idea that at some point, there are so many technologies that we will develop that one of them will allow the person who develops it to easily destroy everyone else. Astronomical waste is the idea that there is a massive amount of stuff in space, but that if we wait a hundred years before grabbing it all, we will still be able to grab pretty much just as much stuff. So there is no need to rush.
To be clear, I think that this work is great. I just thought you had something more illegible in mind by what you consider to be ontological progress. So maybe we’re closer to each other than I thought.
Extinction:
But actually when you start to think about possible edge cases (like: are humans extinct if we’ve all uploaded? Are we extinct if we’ve all genetically modified into transhumans? Are we extinct if we’re all in cryonics and will wake up later?) it starts to seem possible that maybe “almost all of the action” is in the parts of the concept that we haven’t pinned down.
It sometimes seems to me like you jump to the conclusion that all the action is in the edge cases without actually arguing for it. According to most of the traditional stories about AI risk, everyone does literally die. And in worlds where we align AI, I do expect that people will be able to stay in their biological forms if they want to.
Lock in:
concepts like “lock-in”
I’m sympathetic that there’s useful work to do in finding a better ontology here
Human powergrabs:
“human power grabs” (I expect that there will be strong ambiguity about the extent to which AIs are ‘responsible’)
I’ve seen you say this a lot, but still not seen you actually argue for it convincingly. it seems totally possible that alignment will be easy, and that the only force behind the power grab will be coming from humans, with AI only doing it because humans train them to do so. It also seems plausible that the humans that develop superintelligence don’t try to do a power grab, but that the AI is misaligned and does so itself. In my mind, both of the pure case scenarios are very plausible. Again, it seems to me like you’re jumping to the conclusion that all the action is in the edge case, without arguing for it convincingly.
Separating out the two is useful for thinking about mitigations because there are certain technical mitigations you do for misaligned AI that don’t help with human motivation to seek power. And there are certain technical and governance mitigations you would do if you’re worried about human seeking power that would not help with misaligned AIs
Epistemics:
“societal epistemics”
it seems pretty plausible to me that if you improved our fundamental understanding of how societal epistemics works, that would really help with improving it. At the same time, I think identifying that this is a massive lever over the future is important strategy work even if you haven’t yet developed the new ontology. This might be like identifying that AI takeover risk is a big risk without developing the ontology needed to say solve it
Zooming out:
In general, a theme here is that I’m finding myself more sympathetic with your claims if we need to fully solve a v complex problem like alignment. But disagreeing that you need new ontologies to identify new, important problems.
I like the idea that you could play a role as translating between the pro-illegible camp and the more legible sympathetic people, because I think you are a clear writer, but certainly seem drawn to illegible things
When I articulate the case for AI takeover risk to people I know, I don’t find the need to introduce them to new ontologies.… But I think I agree that if you want to actually do technical work to reduce the risks, that it is useful to have new concepts that point out why the risk might arise.
It seems like one crux might be that I place much less value on “articulating a case” than you do. Or maybe another way of putting it is that you draw a cleaner boundary between “articulating a case” and “actually do technical work”, whereas I think of them as pretty continuous when done well.
(Note that this also puts me in disagreement with many rationalists. Many rationalists treat “the case for high P(doom)” as a pretty reliable set of ideas, and then “alignment research” as something we’re very confused about. Whereas from my perspective, these two things are intimately related—developing the concepts and ontology required for a robust case for high P(doom) would actually get us most of the way towards solving the alignment problem.)
Not quite sure how to justify my position here, but one intuition is something like “taking crucial considerations seriously”. It really does seem that people’s overall conclusions about what’s good or bad can be flipped by things they haven’t thought of. For example, I notice that many people pay lip service to the idea that the AI safety movement has significantly accelerated AI capabilities (and that AI governance has polarized the current administration against AI safety), but almost nobody is actually trying to figure out how to systematically avoid making additional mistakes like that.
So how do you make your conclusions and strategies less fragile? Well, here are some domains in which it’s possible to draw robust conclusions: physics, statistics, computer science, etc. Each of these fields have deep ontologies that have been tested against reality in many different ways, such that it’s very hard to imagine their concepts and arguments being totally wrongheaded. Even then, you still have “crucial considerations” like relativity which totally change how you interpret fundamental concepts in those fields—but importantly, Newtonian mechanics was robust enough that even a reconceptualization of core concepts like “space” and “time” actually changed very few of its practical conclusions.
I think it’s also worth mentioning the social element. Making a case of the form “people should be worried about X/people should work on X” is a satisfying thing which gains people (short-term) clout and influence. Conversely, trying to deeply understand X is harder and riskier. This is part of why it seems like there’s a misallocation of effort towards raising awareness of risks rather than trying to solve them (even within what’s usually called “AI safety research”—e.g. evals are mostly in the former category).
To be fair, you also get short-term clout for being a grumpy contrarian, like I’m being now. So I think that if I spend too much time or effort doing so, there’s probably something suspicious about that. Looking back, it seems like I started my current grumpy contrarian arc around mid-2024, when I published this post and this post. So it’s been a year and a half, which is quite a long time! Ideally I’ll stop making posts about which research I dislike within the next 6 months or so, and limit my focus to discussing research I like (or ideally just producing research that speaks for me).
In my mind, there is not much ontological innovation going on in these concepts, because they can be stated in one sentence using pre-existing concepts.
So can the concept of evolutionary fitness, or Galileo’s laws of motion. Yet those are huge ontological innovations—and in general a lot of ontological innovations are very simple in hindsight. (In Galileo’s case, even the realization “stuff on earth follows the same rules as stuff in space” was a huge step forward.) The important part is not that each individual concept is novel or complex, but rather that you get a set of interlocking concepts which bind together to allow you to generate good explanations and predictions.
Okay, it’s helpful to know that you see these as providing new valuable ontologies to some extent.
To be clear, I did say “these ontologies are kinda “thin”, which has made it difficult to use them to do substantive work”. So yeah, not useless but also not central examples of valuable ontological progress.
It sometimes seems to me like you jump to the conclusion that all the action is in the edge cases without actually arguing for it
In the case of the extinction example I only said it was “possible” that all the action is in the edge cases. I use this as an example to illustrate that even concepts which seem really solid are still more flimsy than you’d think, not as an example of a concept that we should discard because it’s near-useless. Sorry for the lack of clarity.
Conversely, in the case of human powergrabs you’re right that I’m making a claim which I haven’t fully argued for. I think the best way to make this argument is just to continue developing my own theories for understanding power grabs (e.g. this kind of thinking, but hopefully much more rigorous). Might take a while unfortunately.
One other thing is that I’d have guessed that the sign uncertainty of historical work on AI safety and AI governance is much more related to the inherent chaotic nature of social and political processes rather than a particular deficiency in our concepts for understanding them.
I’m sceptical that pure strategy research could remove that side uncertainty, and I wonder if it would take something like the ability to run loads of simulations of societies like ours.
inherent chaotic nature of social and political processes
Everything seems inherently chaotic until you understand it well! The motion of the planets across the sky seems very arbitrary until you understand Kepler’s laws, and so on.
Re “simulations”, the easiest way to build a simulation of something is to have a principled model/theory of it.
I do agree that the history of crucial considerations provides a good reason to favour ‘deep understanding’.
I also agree that you plausibly need a much deeper understanding to get to above 90% on P(doom). But I don’t think you need that to get to the action-relevant thresholds, which are much lower.
I’d be interested in learning more about your power grab threat models, so let me know if and when you have something you want to share. And TBC I think you’re right that in many scenarios it will not be clear to other people whether the entity seeking power is ultimately humans or AIs—my current view is that the two possibilities are distinct, and it is plausible that just one of them obtains pretty cleanly.
To me there seem to be many examples of good impactful strategy research that don’t introduce big new ontologies or go via illegibility:
initial case for AI takeover risk
simulation argument
vulnerable world hypothesis
stand args for long termism
argument that avoiding Extinction or existential risk is a tractable way to impact the long term
astronomical waste
highlighting the risk of human power grabs
Importance of using AI to upgrade to societal epistemics / coord
risk from a software only IE
I do also see examples of big contributions that are in the form of new ontologies like reframing superintelligence. But these seem less common to me.
Thanks for making this more concrete! I still disagree with you, but this list provides a good way of articulating why.
Of these, I think the initial case for AI takeover risk was the most impactful by far, and also a great example of introducing a new ontology (which includes concepts like AGI and superintelligence, the orthogonality thesis, instrumental convergence, recursive self-improvement, the nearest unblocked strategy problem, corrigibility, alignment, reward tampering, etc).
The simulation argument, by contrast, is an interesting example of an idea that is nominally very “big if true” but has actually had very little impact. And one reason why I think this is the case is because it doesn’t really change our ontologies much. We add the concept of the universe being a simulation, but in order to understand what’s inside the simulation we still use all our old concepts, and we have almost no new concepts that help us think about what’s outside the simulation. (Off the top of my head, “ancestor simulation” is the only additional novel concept I can recall related to this hypothesis.)
Existential risk and longtermism fall somewhere in the middle. They are also “big if true”, and they have been fleshed out to some degree (e.g. with concepts like the vulnerable world hypothesis, astronomical waste, value lock-in, etc). But I don’t think it’s a coincidence that the vast majority of work under these headings has been on preventing AI takeover. Outside that bucket, I’d say these ontologies are kinda “thin”, which has made it difficult to use them to do substantive work without funneling that work through better-developed ontologies (like AI risk).
What do I mean by “thin”? Roughly two things. Firstly, it’s unclear if the concepts in them are well-defined or well-grounded. For people on LessWrong, a good way to get a sense of such concepts is from reading continental philosophy, where people often use words in vague ways that feel like they’re mostly trying to convey a vibe not a precise meaning. Now, analytic philosophers try to be more precise than continentals. But when reasoning about weird futures, conceptual clarity is extremely difficult to achieve; from a future perspective most strategy/futurism researchers will likely seem as confused as continental philosophers seem to LWers. For example, a concept like “extinction” seems very easy to define. But actually when you start to think about possible edge cases (like: are humans extinct if we’ve all uploaded? Are we extinct if we’ve all genetically modified into transhumans? Are we extinct if we’re all in cryonics and will wake up later?) it starts to seem possible that maybe “almost all of the action” is in the parts of the concept that we haven’t pinned down.
And that becomes much more true when working with far more abstract concepts like “lock-in”. If you ask me “will the values that rule the future be locked in?” the majority of my probability mass is on “something will happen which is a very non-central example of either ‘yes’ or ‘no’ in your current ontology—in other words, your conception of locked-in is too vague for the question to be meaningful”. For more on this, see my post on why I’m not a bayesian. Other concepts which I think are too vague or confused to be very useful for doing strategy research: AGI, LLMs (at what point have we added so much image/video/RL data that summarizing them as “language models” is actively misleading?), “human power grabs” (I expect that there will be strong ambiguity about the extent to which AIs are ‘responsible’), “societal epistemics”, “alignment”, “metaphilosophy”, and a range of others.
Secondly, though, even when concepts are well-defined, are they useful? In mathematics you can easily generate concepts which are well-defined on a technical level but totally uninteresting. One intuition pump here is personality tests: it’s very easy to score people’s personalities on a bunch of different axes, but most of the time those axes are so arbitrary that it’s hard to “hook them in” to the rest of our knowledge about how people think. Similarly, even insofar as DeepMind’s “levels of AGI” are well-defined, they’re just so arbitrary that we shouldn’t expect these concepts to carve reality at its joints. But these kinds of frameworks and taxonomies abound in futurism/strategy research.
And so, instead of trying to produce “strategy research”, I claim people should try to do the kinds of work that produced powerful ontologies in the past—whether that’s inventing theoretical frameworks like probability theory/information theory/computation theory, doing empirical research to produce new scientific fields, doing the kinds of philosophy that produced our current political ontology, etc.
Thanks for articulating your view in such detail. (This was written with transcription software. Sorry if there are mistakes!)
AI risk:
When I articulate the case for AI takeover risk to people I know, I don’t find the need to introduce them to new ontologies. I can just say that AI will be way smarter than humans. It will want things different from what humans want, and so it will want to seize power from us.
But I think I agree that if you want to actually do technical work to reduce the risks, that it is useful to have new concepts that point out why the risk might arise. I think reward hacking, instrumental convergence, and corrigibility are good examples.
To me, this seems like a case where you can identify a new risk without inventing a new ontology, but it’s plausible that you need to make ontological progress to solve the problem.
Simulations:
On the simulation argument, I think that people do in fact reason about the implications of simulations for example thinking about a-causal trade or threat dynamics. So I don’t think that it hasn’t gone anywhere. It obviously hasn’t become very practical yet, but I wouldn’t think that that’s due to the nature of the concept vs the inherent subject matter.
I don’t really understand why we would need new concepts to think about what’s outside a simulation rather than just applying our existing concepts that we use to describe the physical world outside of simulations within our universe and to describe other ways that the universe could have been.
LTism:
Okay, it’s helpful to know that you see these as providing new valuable ontologies to some extent.
In my mind, there is not much ontological innovation going on in these concepts, because they can be stated in one sentence using pre-existing concepts. Vulnerable world hypothesis is the idea that at some point, there are so many technologies that we will develop that one of them will allow the person who develops it to easily destroy everyone else. Astronomical waste is the idea that there is a massive amount of stuff in space, but that if we wait a hundred years before grabbing it all, we will still be able to grab pretty much just as much stuff. So there is no need to rush.
To be clear, I think that this work is great. I just thought you had something more illegible in mind by what you consider to be ontological progress. So maybe we’re closer to each other than I thought.
Extinction:
It sometimes seems to me like you jump to the conclusion that all the action is in the edge cases without actually arguing for it. According to most of the traditional stories about AI risk, everyone does literally die. And in worlds where we align AI, I do expect that people will be able to stay in their biological forms if they want to.
Lock in:
I’m sympathetic that there’s useful work to do in finding a better ontology here
Human powergrabs:
I’ve seen you say this a lot, but still not seen you actually argue for it convincingly. it seems totally possible that alignment will be easy, and that the only force behind the power grab will be coming from humans, with AI only doing it because humans train them to do so. It also seems plausible that the humans that develop superintelligence don’t try to do a power grab, but that the AI is misaligned and does so itself. In my mind, both of the pure case scenarios are very plausible. Again, it seems to me like you’re jumping to the conclusion that all the action is in the edge case, without arguing for it convincingly.
Separating out the two is useful for thinking about mitigations because there are certain technical mitigations you do for misaligned AI that don’t help with human motivation to seek power. And there are certain technical and governance mitigations you would do if you’re worried about human seeking power that would not help with misaligned AIs
Epistemics:
it seems pretty plausible to me that if you improved our fundamental understanding of how societal epistemics works, that would really help with improving it. At the same time, I think identifying that this is a massive lever over the future is important strategy work even if you haven’t yet developed the new ontology. This might be like identifying that AI takeover risk is a big risk without developing the ontology needed to say solve it
Zooming out:
In general, a theme here is that I’m finding myself more sympathetic with your claims if we need to fully solve a v complex problem like alignment. But disagreeing that you need new ontologies to identify new, important problems.
I like the idea that you could play a role as translating between the pro-illegible camp and the more legible sympathetic people, because I think you are a clear writer, but certainly seem drawn to illegible things
It seems like one crux might be that I place much less value on “articulating a case” than you do. Or maybe another way of putting it is that you draw a cleaner boundary between “articulating a case” and “actually do technical work”, whereas I think of them as pretty continuous when done well.
(Note that this also puts me in disagreement with many rationalists. Many rationalists treat “the case for high P(doom)” as a pretty reliable set of ideas, and then “alignment research” as something we’re very confused about. Whereas from my perspective, these two things are intimately related—developing the concepts and ontology required for a robust case for high P(doom) would actually get us most of the way towards solving the alignment problem.)
Not quite sure how to justify my position here, but one intuition is something like “taking crucial considerations seriously”. It really does seem that people’s overall conclusions about what’s good or bad can be flipped by things they haven’t thought of. For example, I notice that many people pay lip service to the idea that the AI safety movement has significantly accelerated AI capabilities (and that AI governance has polarized the current administration against AI safety), but almost nobody is actually trying to figure out how to systematically avoid making additional mistakes like that.
So how do you make your conclusions and strategies less fragile? Well, here are some domains in which it’s possible to draw robust conclusions: physics, statistics, computer science, etc. Each of these fields have deep ontologies that have been tested against reality in many different ways, such that it’s very hard to imagine their concepts and arguments being totally wrongheaded. Even then, you still have “crucial considerations” like relativity which totally change how you interpret fundamental concepts in those fields—but importantly, Newtonian mechanics was robust enough that even a reconceptualization of core concepts like “space” and “time” actually changed very few of its practical conclusions.
I think it’s also worth mentioning the social element. Making a case of the form “people should be worried about X/people should work on X” is a satisfying thing which gains people (short-term) clout and influence. Conversely, trying to deeply understand X is harder and riskier. This is part of why it seems like there’s a misallocation of effort towards raising awareness of risks rather than trying to solve them (even within what’s usually called “AI safety research”—e.g. evals are mostly in the former category).
To be fair, you also get short-term clout for being a grumpy contrarian, like I’m being now. So I think that if I spend too much time or effort doing so, there’s probably something suspicious about that. Looking back, it seems like I started my current grumpy contrarian arc around mid-2024, when I published this post and this post. So it’s been a year and a half, which is quite a long time! Ideally I’ll stop making posts about which research I dislike within the next 6 months or so, and limit my focus to discussing research I like (or ideally just producing research that speaks for me).
So can the concept of evolutionary fitness, or Galileo’s laws of motion. Yet those are huge ontological innovations—and in general a lot of ontological innovations are very simple in hindsight. (In Galileo’s case, even the realization “stuff on earth follows the same rules as stuff in space” was a huge step forward.) The important part is not that each individual concept is novel or complex, but rather that you get a set of interlocking concepts which bind together to allow you to generate good explanations and predictions.
To be clear, I did say “these ontologies are kinda “thin”, which has made it difficult to use them to do substantive work”. So yeah, not useless but also not central examples of valuable ontological progress.
In the case of the extinction example I only said it was “possible” that all the action is in the edge cases. I use this as an example to illustrate that even concepts which seem really solid are still more flimsy than you’d think, not as an example of a concept that we should discard because it’s near-useless. Sorry for the lack of clarity.
Conversely, in the case of human powergrabs you’re right that I’m making a claim which I haven’t fully argued for. I think the best way to make this argument is just to continue developing my own theories for understanding power grabs (e.g. this kind of thinking, but hopefully much more rigorous). Might take a while unfortunately.
One other thing is that I’d have guessed that the sign uncertainty of historical work on AI safety and AI governance is much more related to the inherent chaotic nature of social and political processes rather than a particular deficiency in our concepts for understanding them.
I’m sceptical that pure strategy research could remove that side uncertainty, and I wonder if it would take something like the ability to run loads of simulations of societies like ours.
Everything seems inherently chaotic until you understand it well! The motion of the planets across the sky seems very arbitrary until you understand Kepler’s laws, and so on.
Re “simulations”, the easiest way to build a simulation of something is to have a principled model/theory of it.
But things can be inherently chaotic too!
Agree it’s unclear how much is inherent.
Thanks for this!
I do agree that the history of crucial considerations provides a good reason to favour ‘deep understanding’.
I also agree that you plausibly need a much deeper understanding to get to above 90% on P(doom). But I don’t think you need that to get to the action-relevant thresholds, which are much lower.
I’d be interested in learning more about your power grab threat models, so let me know if and when you have something you want to share. And TBC I think you’re right that in many scenarios it will not be clear to other people whether the entity seeking power is ultimately humans or AIs—my current view is that the two possibilities are distinct, and it is plausible that just one of them obtains pretty cleanly.