Scientific-style risk assessments should be positioned in the frame of some science. For all the examples that you gave (atomic bomb, vacuum decay, etc.) that was obviously physics.
To formally assess and quantify the risks of misaligned AI, the calculations should be done within a frame of the science of intelligence (i.e., “abstract” cognitive science, not tied to the specifics of the human brain) and agency.
No such general science of intelligence exists at the moment. There are frameworks that aspire to describe intelligence, agency, and sentience in general, such as Active Inference (Friston et al., 2022), thermodynamic ML (Boyd et al., 2022), MCR^2 (Ma et al., 2022), and, perhaps, Vanchurin’s “neural network toolbox for describing the world and agents within it”. Of these, I’m somewhat familiar only with Active Inference and Vanchun’s work, and it seems to me that both of them are too general frameworks to be the basis of concrete risk calculations. I’m not familiar with thermodynamic ML work and MCR^2 work at all (a recurrent note: if would be very useful if the alignment community had people who are at least deeply familiar with and conversant in these frameworks).
Then, there are more concrete engineering proposals for A(G)I architectures, such as Transformer-based LLMs, RL agents of specific architectures, or LeCun’s H-JEPA architecture. Empirical and more general theoretical papers have been published about the risks of goal misgeneralisation for RL agents (I believe these works are referenced in Ngo et al.’s “The Alignment Problem from a Deep Learning Perspective”). I hope at least in part due to this (although he doesn’t explicitly acknowledge this, as far as I know), LeCun is sceptical about RL and advocates for minimising RL from intelligence architectures.
Theoretical analysis of the risks of the H-JEPA architecture hasn’t been done yet (let alone empirical, because H-JEPA agents haven’t been constructed yet), however, H-JEPA belongs to a wider class of intelligence architectures that employ energy-based modelling (EBM) of the world. EBM agents have been constructed and could be analysed empirically already for their alignment properties. It’s worth noting that energy-based modelling is conceptually rather similar to Active Inference, and Active Inference agents have been constructed, too (Fountas et al., 2020), and thus could be probed empirically for their alignment properties as well.
The fundamental limitation of analysing the risks of any specific architecture is that once AGI starts to rewrite itself, it could probably easily depart from its original architecture. Then then the question is whether we can organizationally, institutionally, and economically prevent AGIs from taking control of their own self-improvement. This question doesn’t have a scientific answer and it’s impossible to rigorously put a probability on this event. I would only say that if AGI is available in open-source, I think the probability of this will be approximately 100%: surely, some open-source hacker (or, more likely, dozens of them) will task AGI with improving itself, just out of curiosity or due to other motivations.
Likewise, even if AGI is developed strictly within tightly controlled and regulated labs, it’s probably impossible to rigorously assess the probability that the labs will drift into failure (e.g., will accidentally release an unaligned version of AGI through AI), or will “leak” the AGI code or parameter weights. I suspect the majority of AI x-risk in the assessments of MIRI people and others is not even purely technical (“as if we had a perfect plan and executed it perfectly”), but from these kinds of execution, organisation resilience, theft, governmental or military coercion, and other forms of not strictly technical and therefore not formally assessable risk.
Scientific-style risk assessments should be positioned in the frame of some science. For all the examples that you gave (atomic bomb, vacuum decay, etc.) that was obviously physics...No such general science of intelligence exists at the moment.
This is a very good point. IDK how much less true this is becoming, but I agree that one reason it’s hard to make a comprehensive quantitative caser rather than a scattered (though related) set of less rigorous arguments is that there are so many unknown interactions. The physicists inventing atomic weapons knew they needed to be sure they wouldn’t ignite the atmosphere, but at least they knew what it meant to calculate that. Biologists and epidemiologists may not be able to immediately tell how dangerous a new pathogen will be, but at least they have a lot of data and research on existing pathogens and past outbreaks to draw from.
Instead, we’re left relying on more abstract forms of reasoning, which feel less convincing to most people and are much more open to disagreement about reference classes and background assumptions.
Also, AFAICT, TC’s base rate for AI being dangerous seems to be something analogous to, “Well, past technologies have been good on net, even dangerous ones, so [by some intuitive analog of Laplace’s rule of succession] we should expect this to turn out fine.” Whereas mine is more like, “Well, evolution of new hominins has never gone well for past hominins (or great apes) so why should this new increase in intelligence go better for us?” combined with, “Well, we’ve never yet been able to write software that does exactly what we want it to for anything complex, or been able to prevent other humans from misusing and breaking it, so why should this be different?”
Also, AFAICT, TC’s base rate for AI being dangerous seems to be something analogous to, “Well, past technologies have been good on net, even dangerous ones, so [by some intuitive analog of Laplace’s rule of succession] we should expect this to turn out fine.” Whereas mine is more like, “Well, evolution of new hominins has never gone well for past hominins (or great apes) so why should this new increase in intelligence go better for us?” combined with, “Well, we’ve never yet been able to write software that does exactly what we want it to for anything complex, or been able to prevent other humans from misusing and breaking it, so why should this be different?”
Tyler Cohen. I only used initials because the OP did the same.
And yes, I read that post, and I’ve seen similar arguments a number of times, and not just recently. They’re getting a lot sharper recently for obvious reasons, though.
I don’t think the base rates are crazy—the new evolution of hominins one is only wrong if you forget who ‘you’ is. TC and many other people are assuming that ‘we’ will be the ‘you’ that are evolving. (The worry among people here is that ‘they’ will have their own ‘you’.)
And the second example, writing new software that breaks—that is the same as making any new technology, we have done this before, and we were fine last time. Yes there were computer viruses, yes some people lost fingers in looms back in the day. But it was okay in the long run.
I think people arguing against these base rates need to do more work. The base rates are reasonable, it is the lack of updating that makes the difference. So lets help them update!
I think updating against these base rates is the critical thing.
But it’s not really an update. The key difference between optimists and pessimists in this area is the recognition that there are no base rates for something like AGI. We have developed new technologies before, but we have never developed a new species before.
New forms of intelligence and agency are a completely new phenomena. Sonic you wanted to ascribe a base rate of our surviving this with zero previous examples, you’d put it at .5. if you counted all of the previous hominid extinctions as relevant, you’d actually put the base rate much lower.
This really seems like the relevant comparison. Tools don’t kill you, but strange creatures do. AGI will be a creature, not a tool.
Instead, we’re left relying on more abstract forms of reasoning
See, the frustrating thing is, I really don’t think we are! There are loads of clear, concrete things that can be picked out and expanded upon. (See my sibling comment also.)
Honestly not sure if I agree or not, but even if true, it’s very hard to convince most people even with lots of real-world examples and data. Just ask anyone with an interest in the comparative quantitative risk assessment of different electricity sources, or ways of handling waste, and then ask them about the process of getting that permitted and built. And really, could you imagine if we subjected AI labs even to just 10% of the regulation we put in the way of letting people add a bedroom or bathroom to their houses?
Also, it doesn’t take a whole lot of abstraction to be more abstract than the physics examples I was responding too, and even then I don’t think we had nearly as much concrete data as we probably should have about the atmospheric nitrogen question. (Note that the H bomb developers also did the math that made them think Lithium-6 won’t contribute to yield, and were wrong. Not nearly as high stakes, so maybe they weren’t as careful? But still disconcerting).
One thing though—in contrast to the other reply, I’m not so convinced by the problem that
No such general science of intelligence exists at the moment.
This would be like the folks at Los Alomos saying ‘well, we need to model the socioeconomic impacts of the bomb, plus we don’t even know what happens to a human subjected to such high pressures and temperatures, we need a medical model and a biological model’ etc. etc.
They didn’t have a complete science of socioeconomics. Similarly, we don’t have a complete science of intelligence. But I think we should be able to put together a model of some core step of the process (maybe within the realm of physics as you suggest) that can be brought to a discussion.
But thanks again for all the pointers, I will follow some of these threads.
Scientific-style risk assessments should be positioned in the frame of some science. For all the examples that you gave (atomic bomb, vacuum decay, etc.) that was obviously physics.
To formally assess and quantify the risks of misaligned AI, the calculations should be done within a frame of the science of intelligence (i.e., “abstract” cognitive science, not tied to the specifics of the human brain) and agency.
No such general science of intelligence exists at the moment. There are frameworks that aspire to describe intelligence, agency, and sentience in general, such as Active Inference (Friston et al., 2022), thermodynamic ML (Boyd et al., 2022), MCR^2 (Ma et al., 2022), and, perhaps, Vanchurin’s “neural network toolbox for describing the world and agents within it”. Of these, I’m somewhat familiar only with Active Inference and Vanchun’s work, and it seems to me that both of them are too general frameworks to be the basis of concrete risk calculations. I’m not familiar with thermodynamic ML work and MCR^2 work at all (a recurrent note: if would be very useful if the alignment community had people who are at least deeply familiar with and conversant in these frameworks).
Then, there are more concrete engineering proposals for A(G)I architectures, such as Transformer-based LLMs, RL agents of specific architectures, or LeCun’s H-JEPA architecture. Empirical and more general theoretical papers have been published about the risks of goal misgeneralisation for RL agents (I believe these works are referenced in Ngo et al.’s “The Alignment Problem from a Deep Learning Perspective”). I hope at least in part due to this (although he doesn’t explicitly acknowledge this, as far as I know), LeCun is sceptical about RL and advocates for minimising RL from intelligence architectures.
Theoretical analysis of the risks of the H-JEPA architecture hasn’t been done yet (let alone empirical, because H-JEPA agents haven’t been constructed yet), however, H-JEPA belongs to a wider class of intelligence architectures that employ energy-based modelling (EBM) of the world. EBM agents have been constructed and could be analysed empirically already for their alignment properties. It’s worth noting that energy-based modelling is conceptually rather similar to Active Inference, and Active Inference agents have been constructed, too (Fountas et al., 2020), and thus could be probed empirically for their alignment properties as well.
LLMs are also tested empirically for alignment (e.g., OpenAI’s evals, the Machiavelli benchmark). Theoretical analysis of alignment in future LLMs is probably bottlenecked by better theories of transformers and auto-regressive LLMs in general, such as Roberts, Yaida, and Hanin’s deep learning theory (2021), Anthropic’s mathematical framework for transformers (2021), Edelman et al.’s theory of “sparse variable creation” in transformers (2022), Marciano’s theory of DNNs as a semi-classical limit of topological quantum NNs (2022), Bahri et al.’s review of statistical mechanics of deep learning (2022), and other general theories of mechanistic interpretability and representation in DNNs (Räukur et al. 2022). See here for a more detailed discussion.
The fundamental limitation of analysing the risks of any specific architecture is that once AGI starts to rewrite itself, it could probably easily depart from its original architecture. Then then the question is whether we can organizationally, institutionally, and economically prevent AGIs from taking control of their own self-improvement. This question doesn’t have a scientific answer and it’s impossible to rigorously put a probability on this event. I would only say that if AGI is available in open-source, I think the probability of this will be approximately 100%: surely, some open-source hacker (or, more likely, dozens of them) will task AGI with improving itself, just out of curiosity or due to other motivations.
Likewise, even if AGI is developed strictly within tightly controlled and regulated labs, it’s probably impossible to rigorously assess the probability that the labs will drift into failure (e.g., will accidentally release an unaligned version of AGI through AI), or will “leak” the AGI code or parameter weights. I suspect the majority of AI x-risk in the assessments of MIRI people and others is not even purely technical (“as if we had a perfect plan and executed it perfectly”), but from these kinds of execution, organisation resilience, theft, governmental or military coercion, and other forms of not strictly technical and therefore not formally assessable risk.
This is a very good point. IDK how much less true this is becoming, but I agree that one reason it’s hard to make a comprehensive quantitative caser rather than a scattered (though related) set of less rigorous arguments is that there are so many unknown interactions. The physicists inventing atomic weapons knew they needed to be sure they wouldn’t ignite the atmosphere, but at least they knew what it meant to calculate that. Biologists and epidemiologists may not be able to immediately tell how dangerous a new pathogen will be, but at least they have a lot of data and research on existing pathogens and past outbreaks to draw from.
Instead, we’re left relying on more abstract forms of reasoning, which feel less convincing to most people and are much more open to disagreement about reference classes and background assumptions.
Also, AFAICT, TC’s base rate for AI being dangerous seems to be something analogous to, “Well, past technologies have been good on net, even dangerous ones, so [by some intuitive analog of Laplace’s rule of succession] we should expect this to turn out fine.” Whereas mine is more like, “Well, evolution of new hominins has never gone well for past hominins (or great apes) so why should this new increase in intelligence go better for us?” combined with, “Well, we’ve never yet been able to write software that does exactly what we want it to for anything complex, or been able to prevent other humans from misusing and breaking it, so why should this be different?”
Yup, this is the point of Scott Alexander’s “MR Tries The Safe Uncertainty Fallacy”.
BTW, what “TC” is?
Tyler Cohen. I only used initials because the OP did the same.
And yes, I read that post, and I’ve seen similar arguments a number of times, and not just recently. They’re getting a lot sharper recently for obvious reasons, though.
TC is Tyler Cowen.
I don’t think the base rates are crazy—the new evolution of hominins one is only wrong if you forget who ‘you’ is. TC and many other people are assuming that ‘we’ will be the ‘you’ that are evolving. (The worry among people here is that ‘they’ will have their own ‘you’.)
And the second example, writing new software that breaks—that is the same as making any new technology, we have done this before, and we were fine last time. Yes there were computer viruses, yes some people lost fingers in looms back in the day. But it was okay in the long run.
I think people arguing against these base rates need to do more work. The base rates are reasonable, it is the lack of updating that makes the difference. So lets help them update!
I think updating against these base rates is the critical thing.
But it’s not really an update. The key difference between optimists and pessimists in this area is the recognition that there are no base rates for something like AGI. We have developed new technologies before, but we have never developed a new species before.
New forms of intelligence and agency are a completely new phenomena. Sonic you wanted to ascribe a base rate of our surviving this with zero previous examples, you’d put it at .5. if you counted all of the previous hominid extinctions as relevant, you’d actually put the base rate much lower.
This really seems like the relevant comparison. Tools don’t kill you, but strange creatures do. AGI will be a creature, not a tool.
See, the frustrating thing is, I really don’t think we are! There are loads of clear, concrete things that can be picked out and expanded upon. (See my sibling comment also.)
Honestly not sure if I agree or not, but even if true, it’s very hard to convince most people even with lots of real-world examples and data. Just ask anyone with an interest in the comparative quantitative risk assessment of different electricity sources, or ways of handling waste, and then ask them about the process of getting that permitted and built. And really, could you imagine if we subjected AI labs even to just 10% of the regulation we put in the way of letting people add a bedroom or bathroom to their houses?
Also, it doesn’t take a whole lot of abstraction to be more abstract than the physics examples I was responding too, and even then I don’t think we had nearly as much concrete data as we probably should have about the atmospheric nitrogen question. (Note that the H bomb developers also did the math that made them think Lithium-6 won’t contribute to yield, and were wrong. Not nearly as high stakes, so maybe they weren’t as careful? But still disconcerting).
Thanks very much for this thorough response!
One thing though—in contrast to the other reply, I’m not so convinced by the problem that
This would be like the folks at Los Alomos saying ‘well, we need to model the socioeconomic impacts of the bomb, plus we don’t even know what happens to a human subjected to such high pressures and temperatures, we need a medical model and a biological model’ etc. etc.
They didn’t have a complete science of socioeconomics. Similarly, we don’t have a complete science of intelligence. But I think we should be able to put together a model of some core step of the process (maybe within the realm of physics as you suggest) that can be brought to a discussion.
But thanks again for all the pointers, I will follow some of these threads.