A. Harden sensors so that tampering is harder than the intended task
We could design and deploy a lot of redundant sensors, and do science and red-teaming to understand the possible principles by which those sensors can be compromised. In the modern world it’s kind of insane to imagine a human rebellion that not only succeeded but left no trace of itself.
Suppose the diamond in the room we’re monitoring contained a security tag, powered by incoming light, with a cryptographic secret in it, and an interface that allows us to continuously do a zero-knowledge identity proof on it via a laser optical interface, using a laser pointed into the room as input and the (high-speed) camera as output. So to successfully tamper with the camera, the AI would need access to the secret inside the security tag inside the diamond, or at least to reroute the channel to-and-from it. So it would need to be able to steal the diamond, extract the security tag, and reverse engineer its cryptographic secret out of it in a time shorter then the period between camera frames, or similarly replace the diamond with an optically-similar replacement now containing the security tag in the room, or integrate the security tag and its input-output channel into its camera-spoofing system, in each case in the time between frames. None of these are actually physically impossible, but they could be made extremely hard to do in a small fraction of a second (say we’re using a really-high speed camera with basically no temporal gaps between frames).
The basic idea here is that cryptographic techniques may be useful when hardening things. Whether this approach could usefully be extended to anything significantly more AGI-alignment-like than a diamond-in-a-vault scenario is unclear.
If safety literally came down to sensor hardening, I do think cryptographic mechanisms (particularly tamper-proof hardware with cryptographic secrets that destroys itself if it detects trouble) seem like a relevant tool, and it’s quite plausibly you could harden sensors even against wildly superhuman attackers.
It’s an insane-feeling scenario—holistically I doubt it will matter for a variety of reasons, and from a worst-case perspective it’s still not something you can rely on—but I do think there’s some value in pinning these things down.
(In this particular case I think that the weak point is definitely the tagging step. I think cryptographic mechanisms help a huge amount with the issue where a model could intercept signals coming into the datacenter, but you are still in the position where you need your sensors to detect trouble so that the hardware can destroy the secret.)
I’d really like to have a better solution to alignment than one that relied entirely on something comparable to sensor hardening.
What are your thoughts on how value learning interacts with E.L.K.? Obviously the issue with value learning this that it only helps with outer alignment, not inner alignment: you’re transforming the problem from “How do we know the machine isn’t lying to us?” to “How do we know that the machine is actually trying to learn what we want (which includes not being lied to)?” It also explicitly requires the machine to build a model of “what humans want”, and then the complexity level and latent knowledge content required is fairly similar between “figure out what the humans want and then do that” and “figure out what the humans want and then show them a video of what doing that would look like”.
Maybe we should just figure out some way to do surprise inspections on the vault? :-)
I agree that it seems very bad if we build AI systems that would “prefer” to tamper with sensors (including killing humans if necessary) but are prevented from doing so by physical constraints.
I currently don’t see how to approach value learning (in the worst case) without solving something like ELK. If you want to take a value learning perspective, you could view ELK as a subproblem of the easy goal inference problem. If there’s some value learning approach that routes around this problem I’m interested in it, but I haven’t seen any candidates and have spent a long time talking with people about it.
Suppose the diamond in the room we’re monitoring contained a security tag, powered by incoming light, with a cryptographic secret in it, and an interface that allows us to continuously do a zero-knowledge identity proof on it via a laser optical interface, using a laser pointed into the room as input and the (high-speed) camera as output. So to successfully tamper with the camera, the AI would need access to the secret inside the security tag inside the diamond, or at least to reroute the channel to-and-from it. So it would need to be able to steal the diamond, extract the security tag, and reverse engineer its cryptographic secret out of it in a time shorter then the period between camera frames, or similarly replace the diamond with an optically-similar replacement now containing the security tag in the room, or integrate the security tag and its input-output channel into its camera-spoofing system, in each case in the time between frames. None of these are actually physically impossible, but they could be made extremely hard to do in a small fraction of a second (say we’re using a really-high speed camera with basically no temporal gaps between frames).
The basic idea here is that cryptographic techniques may be useful when hardening things. Whether this approach could usefully be extended to anything significantly more AGI-alignment-like than a diamond-in-a-vault scenario is unclear.
If safety literally came down to sensor hardening, I do think cryptographic mechanisms (particularly tamper-proof hardware with cryptographic secrets that destroys itself if it detects trouble) seem like a relevant tool, and it’s quite plausibly you could harden sensors even against wildly superhuman attackers.
It’s an insane-feeling scenario—holistically I doubt it will matter for a variety of reasons, and from a worst-case perspective it’s still not something you can rely on—but I do think there’s some value in pinning these things down.
(In this particular case I think that the weak point is definitely the tagging step. I think cryptographic mechanisms help a huge amount with the issue where a model could intercept signals coming into the datacenter, but you are still in the position where you need your sensors to detect trouble so that the hardware can destroy the secret.)
I’d really like to have a better solution to alignment than one that relied entirely on something comparable to sensor hardening.
What are your thoughts on how value learning interacts with E.L.K.? Obviously the issue with value learning this that it only helps with outer alignment, not inner alignment: you’re transforming the problem from “How do we know the machine isn’t lying to us?” to “How do we know that the machine is actually trying to learn what we want (which includes not being lied to)?” It also explicitly requires the machine to build a model of “what humans want”, and then the complexity level and latent knowledge content required is fairly similar between “figure out what the humans want and then do that” and “figure out what the humans want and then show them a video of what doing that would look like”.
Maybe we should just figure out some way to do surprise inspections on the vault? :-)
I agree that it seems very bad if we build AI systems that would “prefer” to tamper with sensors (including killing humans if necessary) but are prevented from doing so by physical constraints.
I currently don’t see how to approach value learning (in the worst case) without solving something like ELK. If you want to take a value learning perspective, you could view ELK as a subproblem of the easy goal inference problem. If there’s some value learning approach that routes around this problem I’m interested in it, but I haven’t seen any candidates and have spent a long time talking with people about it.