Oh, sorry, you’re referring to this:
includes a distributed network of non-nuclear electromagnetic pulse emitters that will physically shut down any tech infrastructure appearing to be running rogue AI agents.
This just seems like one of those things people say, in the same vein as “melt all of the GPUs”. I think that non-nuclear EMPs are still based on chemical warheads. I don’t know if a “pulse emitter” is a thing that someone could build. Like I think what this sentence actually says is equivalent to saying
includes a distributed network of non-nuclear ICBMs that will be physically shot at any target believed to be running a rogue AI agent
and then we can put an asterisk on the word “ICBM” and say it’ll cause an EMP at the detonation site, and only a small explosion.
But you can see how this now has a different tone to it, doesn’t it? It makes me wonder how the system defines “appears to be running rogue AI agents”, because now I wonder what the % chance of false positives is—since on a false positive, the system launches a missile.
What happens if this hypothetical system is physically located in the United States, but the rogue AI is believed to be in China or Russia? Does this hypothetical system fire a missile into another country? That seems like it could be awkward if they’re not already on board with this plan.
because they’re doing something pretty non-trivial, they probably have to be big complex systems. Because they’re big complex systems, they’re hackable. Does this sound right to you? I’m mostly asking you about the step “detecting rogue AI implies hackable”. Or to expand the question, for what tasks XYZ can you feasibly design a system that does XYZ, but is really seriously not hackable even by a significantly superhuman hacker?
It’s not really about “tasks”, it’s about how the hardware/software system is designed. Even a trivial task, if done on a general-purpose computer, with a normal network switch, the OS firewall turned off, etc, is going to be vulnerable to whatever exploits exist for applications or libraries running on that computer. Those applications or libraries expose vulnerabilities on a general-purpose computer because they’re connected to the internet to check for updates, or they send telemetry, or they’re hosting a Minecraft server with log4j.
It seems like you could not feasibly make an unhackable system that takes a bunch of inputs from another (unsafe) system and processes them in a bunch of complex ways using software that someone is constantly updating, because having the ability to update to the latest Detect-O-Matic-v3.4 without knowing in advance what sort of thing the Detect-O-Matic is, beyond that it’s software, seems to imply being Turing-completely programmable, which seems to imply being hackable.
When you’re analyzing the security of a system, what you’re looking for is “what can the attacker control?”
If the attacker can’t control anything, the system isn’t vulnerable.
We normally distinguish between remote attacks (e.g. over a network) and physical attacks (e.g. due to social engineering or espionage or whatever). It’s generally safe to assume that if an attacker has physical access to a machine, you’re compromised.[1] So first, we don’t want the attacker to have physical access to these computers. That means they’re in a secure facility, with guards, and badges, and access control on doors, just like you’d see in a tech company’s R&D lab.
That leaves remote attacks. These generally come in two forms:
The attacker tricks you into downloading and running some compromised software. For example, visiting a website with malicious JavaScript, or running some untrusted executable you downloaded because it was supposed to be a cheat engine for a video game but it was actually just a keylogger, or the attacker has a malicious payload in a seemingly innocent file type like a Word document or PDF file and it’s going to exploit a bug in the Word program or Adobe Acrobat program that tries to read that file.
The attacker sends network traffic to the machine which is able to compromise the machine in some way, generally by exploiting open ports or servers running on the target machine.
All of the attacks in (1) fall under “when you run untrusted code, you will get pwned” umbrella. There’s a bunch of software mitigations for trying to make this not terrible, like admin users vs non-admin users, file system permissions, VM sandboxing, etc, but ultimately it’s just like rearranging deck chairs on the Titanic. It doesn’t matter what you do, someone else is going to find a side channel attack and ruin your day if you let them run code on your machine. So don’t do that. This is actually easier than you might think: plenty of systems are “secure” because they run an incredibly minimal Linux OS (or some RTOS or even just bare metal) and they’re effectively static—the software image is flashed to some SoC’s read-only memory (ROM) by an external debugger[2], and there’s no capability from within the software to write or update that memory. The processor is not configured for running code outside of that ROM. There’s no user accounts, or “default applications”, or browsers, or anything else other than the code for performing the actual task required by the system.
For (2), in cases where the system is not supposed to be receiving arbitrary traffic, we solve this by using networking hardware downstream of the system that drops any unexpected traffic, and rate-limits everything. If the attacker’s traffic gets dropped, they can’t control anything. The rate-limiting is to prevent a denial-of-service based on traffic we expect. For the “expected” traffic, the ideal mechanism to prevent abuse is a binary schema for valid messages, a fixed MTU size on messages, and the use of authenticated encryption so that the attacker is unable to generate the traffic without access to encryption keys. Encryption keys can be stored in hardware security modules that are physically located on the senders & receivers within the system.[3]
With that in mind, I hope it is clear that the problem isn’t about complexity of a system, or the number of inputs that the system has, or if the system supports software updates.
The actual issue with the system we’re discussing is that the attacker can control their network traffic, and in this hypothetical, we need to look at arbitrary network traffic to decide if it’s evidence of a rogue AI.[4] We can’t just drop everything in hardware and ignore messages that don’t fit an authenticated encryption scheme, because we want to read that traffic.
So we have to assume that the attacker can generate arbitrary payloads, send them to our system, and our system is going to need to not get pwned.
I think I’d probably approach it with the following thoughts:
I’d design the system so that the traffic to be inspected is distributed across a cluster of computers. This is both to prevent denial-of-service attacks and to prevent targeted attacks like Nethammer. If the system is receiving traffic in excess of what the cluster is designed to handle, it’ll be dropped—while this represents a degradation in monitoring capability, it can be fixed by adding additional machines to the cluster as needed. It’s often useful to gracefully degrade capabilities when a system is under load.
The computers in the cluster that receive traffic have a simple role: send an alert message to a central supervisor if suspicious traffic is detected.
The alert message uses authenticated encryption.
The central supervisor is only communicated with via authenticated encryption messages on a local network. It is not connected to the wider network and it is not possible for the attacker to send traffic to the supervisor machine, unless they first compromised one of the cluster machines.
The supervisor machine is pretty safe unless a cluster machine is compromised. If a cluster machine is compromised, maybe we think that the attacker can control the alert message, and in doing so, control traffic sent to the central supervisor? Let’s fix that. What do we need for the alert message? Maybe the IP of the packet, geolocation information, reasons why it was suspicious? Etc. We’ll define a binary schema for that information and we’ll stick a serial device or FPGA onto each cluster machine.
The cluster machines send the alert message by setting the relevant fields of the message via the serial device or FPGA registers, and it’s that device which sends the actual alert. Now if the cluster machine is compromised, all the attacker can do is change fields in the alert message, but they can’t send different messages entirely.[5]
We test the central supervisor to confirm that any permutation of the alert message is still safe to receive. For testing permutations, we normally use a method based on equivalence classes.
So now attackers can maybe compromise cluster machines because they control the traffic that we’re trying to monitor, but they can’t use a compromised cluster machine to control traffic to the central supervisor, because compromising a cluster machine doesn’t let you change the traffic sent to a supervisor in a way that would be dangerous.
At this point we can ask what the attacker can do:
They can maybe send garbage alert messages.
They can maybe send no message at all—i.e. they crash the cluster machine so it fails silent.
We can mitigate the 2nd problem by requiring a heartbeat from each cluster machine to the central supervisor. A heartbeat is another message sent via authenticated encryption, but it’s sent at a known frequency, like 100 Hz. If an attacker or any other event crashes the machine, the central supervisor will know.
If the attacker sends a garbage alert message, what is the downside? It could be bad if this system launches missiles, so maybe we’ll have human review on any alert that the central supervisor receives, prior to any action being taken by the system. The human review can include a log of the network traffic that generated the alert, captured by the cluster machine, and a diagnosis of the cluster machine’s telemetry for how it concluded the traffic was suspicious. This is intended to mitigate the possibility of a false positive where the attacker can control where we think a rogue AI is located.
Now, where we stand is that an attacker can maybe compromise a cluster machine, and in doing so, they can maybe generate an erroneous alert message for a rogue AI, which a human review maybe won’t be able to detect as erroneous, so whatever action gets taken on belief that a rogue AI exists would be sent to a location controlled by the attacker. The magnitude of how bad that is depends on what action is taken. If it is “fire an ICBM”, it could be pretty bad.
The other possibility is that an attacker compromises a machine in such a way that no alert is generated, but no heartbeat is missed either. This is an interesting failure mode because it’s equivalent to another failure mode that we have not discussed: the system could simply fail to detect the behavior of a rogue AI. In other words, this system doesn’t need to be hacked in order to fail. If the algorithm we’re using for detecting rogue AI has any margin for error at all, then it should be no surprise to anyone that this system is susceptible to a false negative, i.e.: it did not send an alert when it should have.
The relevant question becomes, what’s the margin of error on the hypothetical algorithm for detecting a rogue AI? If that algorithm has a higher likelihood of a false positive than our hypothetical attacker, after all of our mitigations, than maybe this system isn’t even worth building, depending on how high that likelihood is. Likewise, if the algorithm has a high likelihood of a false negative, such that a rogue AI can reasonably evade detection due to not being suspicious enough, then this system might still not be worth building, because now the most likely failure mode is that the system just doesn’t work reliably enough, hacked or not.
- ^
This is a safe assumption because if the attacker has physical access to a machine, they can do dumb things like pull out the hard drive and put it into another computer that they control. Enable encryption on your drives.
- ^
When I say “external debugger”, this is implying physical access to the system.
- ^
In order to get the encryption keys, the attacker needs physical access to the system. Attacks that would use a side-channel attack won’t work here, because we don’t accept arbitrary traffic, and in order to generate that traffic, you need the keys. It’s a catch-22, unless you have physical access. But if you have physical access, we assume the system is compromised.
- ^
To be clear, I am not convinced that “evidence of a rogue AI” is a meaningful description of behavior.
- ^
This is assuming that we’ve verified the device that sends the alert message, which we’ll have high confidence in because of how simple the implementation will be.
I really like this list because it does a great job of explicitly specifying the same behavior I was trying to vaguely gesture at in my list when I kept referring to AGI-as-a-contract-engineer.
Even your point about it doesn’t have to succeed, it’s ok for it to fail at a task if it can’t reach it in some obvious, non-insane way—that’s what I’d expect from a contractor. The idea that an AGI would find that a task is generally impossible but identify a novel edge case that allows it to be accomplished with some ridiculous solution involving nanotech and then it wouldn’t alert or tell a human about that plan prior to taking it has always been confusing to me.
In engineering work, we almost always have expected budget / time / material margins for what a solution looks like. If someone thinks that solution space is empty (it doesn’t close), but they find some other solution that would work, people discuss that novel solution first and agree to it.
That’s a core behavior I’d want to preserve. I sketched it out in another document I was writing a few weeks ago, but I was considering it in the context of what it means for an action to be acceptable. I was thinking that it’s actually very context dependent—if we approve an action for AGI to take in one circumstance, we might not approve that action in some vastly different circumstance, and I’d want the AGI to recognize the different circumstances and ask for the previously-approved-action-for-circumstance-A to be reapproved-for-circumstance-B.
EDIT: Posting this has made me realize that idea of context dependencies is applicable more widely than just allowable actions, and it’s relevant to discussion of what it means to “optimize” or “solve” a problem as well. I’ve suggested this in my other posts but I don’t think I ever said it explicitly: if you consider human infrastructure, and human economies, and human technology, almost all “optimal” solutions (from the perspective of a human engineer) are going to be built on the existing pile of infrastructure we have, in the context of “what is cheapest, easiest, the most straight line path to a reasonably good solution that meets the requirements”. There is a secret pile of “optimal” (in the context of someone doing reasoning from first principles) solutions that involve ignoring all of human technology and bootstrapping a new technology tree from scratch, but I’d argue that’s a huge overlap if not the exact same set as the things people have called “weird” in multiple lists. Like if I gave a contractor a task to design a more efficient paperclip factory and they gave me a proposed plan that made zero reference to buying parts from our suppliers or showed the better layout of traditional paper-clip making machines or improvements to how an existing paper-clip machine works, I’d be confused, because that contractor is likely handing me a plan that would require vertically integrating all of the dependencies, which feels like complete overkill for the task that I assigned. Even if I phrased my question to a contractor as “design me the most efficient paperclip factory”, they’d understand constraints like: this company does not own the Earth, therefore you may not reorder the Earth’s atoms into a paperclip factory. They’d want to know, how much space am I allowed? How tall can the building be? What’s the allowable power usage? Then they’d design the solution inside of those constraints. That is how human engineering works. If an AGI mimicked that process and we could be sure it wasn’t deceptive (e.g. due to interpretability work), then I suspect that almost all claims about how AGI will immediately kill everyone are vastly less likely, and the remaining ways AGI can kill people basically reduce to the people controlling the AGI deliberately using it to kill people, in the same way that the government uses military contractors to design new and novel ways of killing people, except the AGI would be arbitrarily good at that exercise.