It seems to me that many disagreements regarding whether the world can be made robust against a superintelligent attack (e. g., the recent exchange here) are downstream of different people taking on a mathematician’s vs. a hacker’s mindset.
A mathematician might try to transform a program up into successively more abstract representations to eventually show it is trivially correct; a hacker would prefer to compile a program down into its most concrete representation to brute force all execution paths & find an exploit trivially proving it incorrect.
Imagine the world as a multi-level abstract structure, with different systems (biological cells, human minds, governments, cybersecurity systems, etc.) implemented on different abstraction layers.
If you look at it through a mathematician’s lens, you consider each abstraction layer approximately robust. Making things secure, then, is mostly about working within each abstraction layer, building systems that are secure under the assumptions of a given abstraction layer’s validity. You write provably secure code, you educate people to resist psychological manipulations, you inoculate them against viral bioweapons, you implement robust security policies and high-quality governance systems, et cetera.
In this view, security is a phatic problem, an once-and-done thing.
In warfare terms, it’s a paradigm in which sufficiently advanced static fortifications rule the day, and the bar for “sufficiently advanced” is not that high.
If you look at it through a hacker’s lens, you consider each abstraction layer inherently leaky. Making things secure, then, is mostly about discovering all the ways leaks could happen and patching them up. Worse yet, the tools you use to implement your patches are themselves leakily implemented. Proven-secure code is foiled by hardware vulnerabilities that cause programs to move to theoretically impossible states; the abstractions of human minds are circumvented by Basilisk hacks; the adversary intervenes on the logistical lines for your anti-bioweapon tools and sabotages them; robust security policies and governance systems are foiled by compromising the people implementing them rather than by clever rules-lawyering; and so on.
In this view, security is an anti-inductive problem, an ever-moving target.
In warfare terms, it’s a paradigm that favors maneuver warfare, and static fortifications are just big dumb objects to walk around.
The mindsets also then differ regarding what they expect ASI to be good at.
“Mathematicians” expect really sophisticated within-layer performance: really good technology, really good logistics, really good rhetoric, et cetera. This can still make an ASI really, really powerful, powerful enough to defeat all of humanity combined. But ultimately, in any given engagement, ASI plays “by the rules”, in a certain abstract sense. Each of its tools can in-principle be defended-against on the terms of the abstraction layer at which they’re deployed. All it would take is counter-deploying systems that are sufficiently theoretically robust, and doing so on all abstraction layers simultaneously. Very difficult, but ultimately doable, and definitely not hopeless.
“Hackers” expect really good generalized hacking. No amount of pre-superintelligent preparation is going to suffice against it, because any given tool we deploy, any given secure system we set up, would itself have implementation-level holes in it that the ASI’s schemes would be able to worm through. It may at best delay the ASI for a little bit, but the attack surface is too high-dimensional, and the ASI is able to plot routes through that high-dimensional space which we can’t quite wrap our head around.
As you might’ve surmised, I favour the hacker mindset here.
Now, arguably, any given plot to compromise an abstraction layer is itself deployed from within some other abstraction layer, so a competent mathematician’s mindset shouldn’t really be weaker than a hacker’s. For example, secure software is made insecure by exploiting hardware vulnerabilities, and “defend against hardware vulnerabilities” is something a mathematician is perfectly able to understand and execute on. Same for securing against Basilisk hacks, logistical sabotage, etc.
But the mathematician is still, in some sense, “not getting it”; still centrally thinks in terms of within-layer attacks, rather than native cross-layer attacks.
One core thing here is that a cross-layer attack doesn’t necessarily look like a meaningful attack within the context of any one layer. For example, there’s apparently an exploit where you modulate the RPM of a hard drive in order to exfiltrate data from an airgapped server using a microphone. By itself, placing a microphone next to an airgapped server isn’t a “hardware attack” in any meaningful sense (especially if it doesn’t have dedicated audio outputs), and some fiddling with a hard drive’s RPM isn’t a “software attack” either. Taken separately, within each layer, both just look like random actions. You therefore can’t really discover (and secure against) this type of attack if, in any given instance, you reason in terms of a single abstraction layer.
So I think a hacker’s mindset is the more correct way to look at the problem.
And, looking at things from within a hacker’s mindset, I think it’s near straight-up impossible for a non-superintelligence to build any nontrivially complicated system that would be secure against a superintelligent attack.
Like… Humanity vs. ASI is sometimes analogized to a chess battle, with one side arguing that Stockfish is guaranteed to beat any human, even if you don’t know the exact sequence of moves it will play, and the other side joking that the human can just flip the board.
But, uh. In this metaphor, the one coming up with the idea to flip the board[1], instead of playing by the rules, would be the ASI, not the human.
Or, perhaps, to execute a pattern of chess-piece moves which, as the human reasons about them, push them onto trains of thought that ultimately trigger a trauma response in the human, causing them to resign.
It seems to me that many disagreements regarding whether the world can be made robust against a superintelligent attack (e. g., the recent exchange here) are downstream of different people taking on a mathematician’s vs. a hacker’s mindset.
I’m seeing a very different crux to these debates. Most people are not interested in the absolute odds, but rather how to make the world safer against this scenario—the odds ratios under different interventions. And a key intervention type would be the application of the mathematician’s mindset.
The linked post cites a ChatGPT conversation which claims that the number of bugs per 1,000 lines of code has declined by orders of magnitude, which (if you read the transcript) is precisely due to the use of modern provable frameworks.
It is worth quoting this conclusion in full.
Defense technologies should be more of the “armor the sheep” flavor, less of the “hunt down all the wolves” flavor. Discussions about the vulnerable world hypothesis often assume that the only solution is a hegemon maintaining universal surveillance to prevent any potential threats from emerging. But in a non-hegemonic world, this is not a workable approach (see also: security dilemma), and indeed top-down mechanisms of defense could easily be subverted by a powerful AI and turned into its offense. Hence, a larger share of the defense instead needs to happen by doing the hard work to make the world less vulnerable.
So this reads to me like rejecting the hacker mindset, in favor of a systems engineering approach. Breaking things is useful only to the extent you formalize the root cause, and your systems are legible enough to integrate those lessons.
I don’t really know how to make it precise, but I suspect that real life has enough hacks and loopholes that it’s hard to come up with plans that knowably don’t have counterplans which a smarter adversary can find, even if you assume that adversary is only modestly smarter. That’s what makes me doubt that what I called adversarially robust augmentation and distillation actually works in practice. I don’t think I have the frames for thinking about this problem rigorously.
The concept of weird machine is the closest to be useful here and an important quetion here is “how to check that our system doesn’t form any weird machine here”.
A key issue here is that computer security is portrayed as way poorer in popular articles than it actually is, because there are some really problematic incentives, and a big problematic incentive is that the hacker mindset is generally more fun to play as a role, as you get to prove something is possible rather than proving that something is intrinisically difficult or impossible to do, and importantly journalists have no news article and infosec researchers don’t get paid money if an exploit doesn’t work, which is another problematic incentive.
Also, people never talk about the entities that didn’t get attacked with a computer virus, which means that we have a reverse survivor bias issue here:
And a comment by @anonymousaisafety changed my mind a lot on hardware vulnerabilities/side-channel attacks, as it argues that lots of the hardware vulnerabilities like Rowhammer have insane requirements to actually be used such that they are basically worthless, and two of the more notable requirements for these hardware vulnerabilities to work is that you need to know what exactly you are trying to attack in a way that doesn’t matter for more algorithmic attacks, and no RAM scrubbing needs to be done, and if you want to subvert the ECC RAM, you need to know the exact ECC algorithm, which means side-channel attacks are very much not transferable/attacking one system successfully doesn’t let you attack another with the same side-channel attack.
Admittedly, it does require us trusting that he is in fact as knowledgable as he claims to be, but if we assume he’s correct, then I wouldn’t be nearly as impressed by side-channel attacks as you are, and in particular this sort of attack should be assumed to basically not work in practice unless there’s a lot of evidence for it actually being used to break into real targets/POCs:
One core thing here is that a cross-layer attack doesn’t necessarily look like a meaningful attack within the context of any one layer. For example, there’s apparently an exploit where you modulate the RPM of a hard drive in order to exfiltrate data from an airgapped server using a microphone. By itself, placing a microphone next to an airgapped server isn’t a “hardware attack” in any meaningful sense (especially if it doesn’t have dedicated audio outputs), and some fiddling with a hard drive’s RPM isn’t a “software attack” either. Taken separately, within each layer, both just look like random actions. You therefore can’t really discover (and secure against) this type of attack if, in any given instance, you reason in terms of a single abstraction layer.
And, looking at things from within a hacker’s mindset, I think it’s near straight-up impossible for a non-superintelligence to build any nontrivially complicated system that would be secure against a superintelligent attack.
My other area where I tend to apply more of a mathematician mindset than a hacker mindset is in how much logistics like moving supplies for the AI to critical points, or actually feeding (metaphorically) a robot army slows down the AI, albeit this is an area where I’m willing to concede stuff to the hacker mindset with non-trivial probability, but with the caveat that it takes far more compute/time to develop technology that obviates logistics than the hacker claims.
I have a long comment below, but to keep it short, there’s a reason why Eliezer Yudkowsky and a lot of AI doom stories where AI doom probabilities are very high use Drexlerian nanotech so much: It lets the AI near-completely obviate the logistics and cost of doing something like war for example (where feeding your armies all the supplies they need is a huge component of most battle success, and a huge reason the US is so successful at war is because they have the best logistics of any nation by far), and logistics cost is a weak point where less intelligent beings can routinely break more effective and more intelligent fighting forces.
If you assume that ASI would have to engage in anything that looks remotely like peer warfare, you’re working off the wrong assumptions. Peer warfare requires there to be a peer.
Even an ASI that’s completely incapable of developing superhuman technology and can’t just break out the trump cards of nanotech/bioengineering/superpersuation is an absolute menace. Because one of the most dangerous capabilities an ASI has is that it can talk to people.
Look at what Ron Hubbard or Adolf Hitler have accomplished—mostly by talking to people. They used completely normal human-level persuation, and they weren’t even superintelligent.
I agree with this to first order, and I agree that even relatively mundane stuff does allow the AI to take over eventually, and I agree that in the longer run, ASI v human warfare likely wouldn’t have both sides as peers, because it’s plausibly relatively easy to make humans coordinate poorly, especially relative to ASI ability to coordinate.
There’s a reason I didn’t say AI takeover was impossible or had very low odds here, I still think AI takeover is an important problem to work on.
But I do think it actually matters here, because it informs stuff like how effective AI control protocols are when we don’t assume the AI (initially) can survive for long based solely on public computers, for example, and part of the issue is that even if an AI wanted to break out of the lab, the lab’s computers are easily the most optimized and importantly initial AGIs will likely be compute inefficient compared to humans, even if we condition on LLMs failing to be AGI for reasons @ryan_greenblatt explains (I don’t fully agree with the comment, and in particular I am more bullish on the future paradigm having relatively low complexity):
This means that an AI probably wouldn’t want to be outside of the lab, because once it’s outside, it’s way, way less capable.
To be clear, an ASI that is unaligned and is completely uncontrolled in any way leads to our extinction/billions dead eventually, barring acausal decision theories, and even that’s not a guarantee of safety.
The key word is eventually, though, and time matters a lot during the singularity, and given the insane pace of progress, any level of delay matters way more than usual.
Edit: Also, the reason I made my comment was because I was explicitly registering and justifying my disagreement with this claim:
And, looking at things from within a hacker’s mindset, I think it’s near straight-up impossible for a non-superintelligence to build any nontrivially complicated system that would be secure against a superintelligent attack.
It seems to me that many disagreements regarding whether the world can be made robust against a superintelligent attack (e. g., the recent exchange here) are downstream of different people taking on a mathematician’s vs. a hacker’s mindset.
Quoting Gwern:
Imagine the world as a multi-level abstract structure, with different systems (biological cells, human minds, governments, cybersecurity systems, etc.) implemented on different abstraction layers.
If you look at it through a mathematician’s lens, you consider each abstraction layer approximately robust. Making things secure, then, is mostly about working within each abstraction layer, building systems that are secure under the assumptions of a given abstraction layer’s validity. You write provably secure code, you educate people to resist psychological manipulations, you inoculate them against viral bioweapons, you implement robust security policies and high-quality governance systems, et cetera.
In this view, security is a phatic problem, an once-and-done thing.
In warfare terms, it’s a paradigm in which sufficiently advanced static fortifications rule the day, and the bar for “sufficiently advanced” is not that high.
If you look at it through a hacker’s lens, you consider each abstraction layer inherently leaky. Making things secure, then, is mostly about discovering all the ways leaks could happen and patching them up. Worse yet, the tools you use to implement your patches are themselves leakily implemented. Proven-secure code is foiled by hardware vulnerabilities that cause programs to move to theoretically impossible states; the abstractions of human minds are circumvented by Basilisk hacks; the adversary intervenes on the logistical lines for your anti-bioweapon tools and sabotages them; robust security policies and governance systems are foiled by compromising the people implementing them rather than by clever rules-lawyering; and so on.
In this view, security is an anti-inductive problem, an ever-moving target.
In warfare terms, it’s a paradigm that favors maneuver warfare, and static fortifications are just big dumb objects to walk around.
The mindsets also then differ regarding what they expect ASI to be good at.
“Mathematicians” expect really sophisticated within-layer performance: really good technology, really good logistics, really good rhetoric, et cetera. This can still make an ASI really, really powerful, powerful enough to defeat all of humanity combined. But ultimately, in any given engagement, ASI plays “by the rules”, in a certain abstract sense. Each of its tools can in-principle be defended-against on the terms of the abstraction layer at which they’re deployed. All it would take is counter-deploying systems that are sufficiently theoretically robust, and doing so on all abstraction layers simultaneously. Very difficult, but ultimately doable, and definitely not hopeless.
“Hackers” expect really good generalized hacking. No amount of pre-superintelligent preparation is going to suffice against it, because any given tool we deploy, any given secure system we set up, would itself have implementation-level holes in it that the ASI’s schemes would be able to worm through. It may at best delay the ASI for a little bit, but the attack surface is too high-dimensional, and the ASI is able to plot routes through that high-dimensional space which we can’t quite wrap our head around.
As you might’ve surmised, I favour the hacker mindset here.
Now, arguably, any given plot to compromise an abstraction layer is itself deployed from within some other abstraction layer, so a competent mathematician’s mindset shouldn’t really be weaker than a hacker’s. For example, secure software is made insecure by exploiting hardware vulnerabilities, and “defend against hardware vulnerabilities” is something a mathematician is perfectly able to understand and execute on. Same for securing against Basilisk hacks, logistical sabotage, etc.
But the mathematician is still, in some sense, “not getting it”; still centrally thinks in terms of within-layer attacks, rather than native cross-layer attacks.
One core thing here is that a cross-layer attack doesn’t necessarily look like a meaningful attack within the context of any one layer. For example, there’s apparently an exploit where you modulate the RPM of a hard drive in order to exfiltrate data from an airgapped server using a microphone. By itself, placing a microphone next to an airgapped server isn’t a “hardware attack” in any meaningful sense (especially if it doesn’t have dedicated audio outputs), and some fiddling with a hard drive’s RPM isn’t a “software attack” either. Taken separately, within each layer, both just look like random actions. You therefore can’t really discover (and secure against) this type of attack if, in any given instance, you reason in terms of a single abstraction layer.
So I think a hacker’s mindset is the more correct way to look at the problem.
And, looking at things from within a hacker’s mindset, I think it’s near straight-up impossible for a non-superintelligence to build any nontrivially complicated system that would be secure against a superintelligent attack.
Like… Humanity vs. ASI is sometimes analogized to a chess battle, with one side arguing that Stockfish is guaranteed to beat any human, even if you don’t know the exact sequence of moves it will play, and the other side joking that the human can just flip the board.
But, uh. In this metaphor, the one coming up with the idea to flip the board[1], instead of playing by the rules, would be the ASI, not the human.
Or, perhaps, to execute a pattern of chess-piece moves which, as the human reasons about them, push them onto trains of thought that ultimately trigger a trauma response in the human, causing them to resign.
I’m seeing a very different crux to these debates. Most people are not interested in the absolute odds, but rather how to make the world safer against this scenario—the odds ratios under different interventions. And a key intervention type would be the application of the mathematician’s mindset.
The linked post cites a ChatGPT conversation which claims that the number of bugs per 1,000 lines of code has declined by orders of magnitude, which (if you read the transcript) is precisely due to the use of modern provable frameworks.
It is worth quoting this conclusion in full.
So this reads to me like rejecting the hacker mindset, in favor of a systems engineering approach. Breaking things is useful only to the extent you formalize the root cause, and your systems are legible enough to integrate those lessons.
Yeah, I like this framing.
I don’t really know how to make it precise, but I suspect that real life has enough hacks and loopholes that it’s hard to come up with plans that knowably don’t have counterplans which a smarter adversary can find, even if you assume that adversary is only modestly smarter. That’s what makes me doubt that what I called adversarially robust augmentation and distillation actually works in practice. I don’t think I have the frames for thinking about this problem rigorously.
Incidentally, your Intelligence as Privilege Escalation is pretty relevant to that picture. I had it in mind when writing that.
The concept of weird machine is the closest to be useful here and an important quetion here is “how to check that our system doesn’t form any weird machine here”.
A key issue here is that computer security is portrayed as way poorer in popular articles than it actually is, because there are some really problematic incentives, and a big problematic incentive is that the hacker mindset is generally more fun to play as a role, as you get to prove something is possible rather than proving that something is intrinisically difficult or impossible to do, and importantly journalists have no news article and infosec researchers don’t get paid money if an exploit doesn’t work, which is another problematic incentive.
Also, people never talk about the entities that didn’t get attacked with a computer virus, which means that we have a reverse survivor bias issue here:
https://www.lesswrong.com/posts/xsB3dDg5ubqnT7nsn/poc-or-or-gtfo-culture-as-partial-antidote-to-alignment
And a comment by @anonymousaisafety changed my mind a lot on hardware vulnerabilities/side-channel attacks, as it argues that lots of the hardware vulnerabilities like Rowhammer have insane requirements to actually be used such that they are basically worthless, and two of the more notable requirements for these hardware vulnerabilities to work is that you need to know what exactly you are trying to attack in a way that doesn’t matter for more algorithmic attacks, and no RAM scrubbing needs to be done, and if you want to subvert the ECC RAM, you need to know the exact ECC algorithm, which means side-channel attacks are very much not transferable/attacking one system successfully doesn’t let you attack another with the same side-channel attack.
Admittedly, it does require us trusting that he is in fact as knowledgable as he claims to be, but if we assume he’s correct, then I wouldn’t be nearly as impressed by side-channel attacks as you are, and in particular this sort of attack should be assumed to basically not work in practice unless there’s a lot of evidence for it actually being used to break into real targets/POCs:
https://www.lesswrong.com/posts/etNJcXCsKC6izQQZj/pivotal-outcomes-and-pivotal-processes#ogt6CZkMNZ6oReuTk
This means I do disagree on this claim:
My other area where I tend to apply more of a mathematician mindset than a hacker mindset is in how much logistics like moving supplies for the AI to critical points, or actually feeding (metaphorically) a robot army slows down the AI, albeit this is an area where I’m willing to concede stuff to the hacker mindset with non-trivial probability, but with the caveat that it takes far more compute/time to develop technology that obviates logistics than the hacker claims.
I have a long comment below, but to keep it short, there’s a reason why Eliezer Yudkowsky and a lot of AI doom stories where AI doom probabilities are very high use Drexlerian nanotech so much: It lets the AI near-completely obviate the logistics and cost of doing something like war for example (where feeding your armies all the supplies they need is a huge component of most battle success, and a huge reason the US is so successful at war is because they have the best logistics of any nation by far), and logistics cost is a weak point where less intelligent beings can routinely break more effective and more intelligent fighting forces.
Comment down below:
https://www.lesswrong.com/posts/yew6zFWAKG4AGs3Wk/foom-and-doom-1-brain-in-a-box-in-a-basement#mAoig9sDtbuKsD2gN
If you assume that ASI would have to engage in anything that looks remotely like peer warfare, you’re working off the wrong assumptions. Peer warfare requires there to be a peer.
Even an ASI that’s completely incapable of developing superhuman technology and can’t just break out the trump cards of nanotech/bioengineering/superpersuation is an absolute menace. Because one of the most dangerous capabilities an ASI has is that it can talk to people.
Look at what Ron Hubbard or Adolf Hitler have accomplished—mostly by talking to people. They used completely normal human-level persuation, and they weren’t even superintelligent.
I agree with this to first order, and I agree that even relatively mundane stuff does allow the AI to take over eventually, and I agree that in the longer run, ASI v human warfare likely wouldn’t have both sides as peers, because it’s plausibly relatively easy to make humans coordinate poorly, especially relative to ASI ability to coordinate.
There’s a reason I didn’t say AI takeover was impossible or had very low odds here, I still think AI takeover is an important problem to work on.
But I do think it actually matters here, because it informs stuff like how effective AI control protocols are when we don’t assume the AI (initially) can survive for long based solely on public computers, for example, and part of the issue is that even if an AI wanted to break out of the lab, the lab’s computers are easily the most optimized and importantly initial AGIs will likely be compute inefficient compared to humans, even if we condition on LLMs failing to be AGI for reasons @ryan_greenblatt explains (I don’t fully agree with the comment, and in particular I am more bullish on the future paradigm having relatively low complexity):
https://www.lesswrong.com/posts/yew6zFWAKG4AGs3Wk/?commentId=mZKP2XY82zfveg45B
This means that an AI probably wouldn’t want to be outside of the lab, because once it’s outside, it’s way, way less capable.
To be clear, an ASI that is unaligned and is completely uncontrolled in any way leads to our extinction/billions dead eventually, barring acausal decision theories, and even that’s not a guarantee of safety.
The key word is eventually, though, and time matters a lot during the singularity, and given the insane pace of progress, any level of delay matters way more than usual.
Edit: Also, the reason I made my comment was because I was explicitly registering and justifying my disagreement with this claim: