One intervention that seems robustly helpful is keeping tamper-proof / tamper-evident logs, so you can demonstrate that a certain incident is in fact the actual and complete trajectory of your AI, that you are fully representing the inputs and context presented to it, etc.
But I suspect that even the reveal of a matching, tamper-evident log of the trajectory might not be enough, unless you declared in advance that you were doing this and had a third-party verify that your logging system works the ways you claim it does.
One way to increase the credibility of the logs might be for AI companies to publish hashes of their internal trajectories in close-to-real-time. Then, if a trajectory is later found to be shady, the earlier publication of its hash gives the authenticity a bit more weight—though even that might be accused of patiently seeding a false trajectory.
To be clear, I’m not proposing that AI companies should publish hashes of their trajectories, per se, in part because it might throw off other information they don’t necessarily want to.
But I do want to gesture at the mechanism that’s important to think through: ‘How do you prove that what went in and came out of your system is in fact exactly as you are representing?’
in part because it might throw off other information they don’t necessarily want to.
What other information does it release? All I can see is that it tells you how many agent trajectories they’re running. They could obfuscate that by also generating a bunch of hashes of random strings and publishing those too.
You don’t even have to publish one hash per trajectory. You can just publish a commitment every 50ms. ChatGPT thinks it is doable with zero-knowledge proofs. (transcript)
This is a good one-paragraph description:
We keep a private append-only log of system events. Every 50 milliseconds, regardless of activity, we publish a short opaque signed anchor that commits to the current hidden log state without revealing its size or whether new events occurred. Later, if an incident needs to be disclosed, we can reveal a selected entry or short segment along with a zero-knowledge proof showing that it really was part of the previously anchored log and that the log was extended only by appending, not editing or deletion.
Nice nice, yeah I’m sure people could design something clever here & I’d be excited to learn if people are working on it (not a good fit for me directly)
Yeah that’s fair that you could probably obscure the signal—maybe this is just FUD from me. But two things I was imagining:
Absolute volumes and trends over time (how much AI labor are they using)
Relative volumes at certain times/days (tells you something about how hands-on a role the human employees need to be playing? based on how much it dips when they aren’t online)
Also just a general concern of ‘you might be leaking information that you didn’t anticipate, and so it’s more secure to say less’
Also, you’d be creating a pretty valuable target if certain hash-functions do eventually fall (think how much the companies want to avoid distillation today), but maybe that’s such a bad scenario that it’s not worth accounting for?
This reminds me of another communication problem I’ve been musing on here and there. If you solved the alignment problem to a sufficient degree that it was wise for humanity to proceed with ASI, could you convince others it was real and to take it seriously? It is a message that I would desperately want to effectively reach me and I harbor concerns that it might not.
I’ve been thinking more about ‘If you caught your AI red-handed on scheming, could you actually convince others that it was real and to take it seriously?’ Buck Shlegeris wrote about this idea here: https://www.lesswrong.com/posts/YTZAmJKydD5hdRSeG/would-catching-your-ais-trying-to-escape-convince-ai
One intervention that seems robustly helpful is keeping tamper-proof / tamper-evident logs, so you can demonstrate that a certain incident is in fact the actual and complete trajectory of your AI, that you are fully representing the inputs and context presented to it, etc.
But I suspect that even the reveal of a matching, tamper-evident log of the trajectory might not be enough, unless you declared in advance that you were doing this and had a third-party verify that your logging system works the ways you claim it does.
One way to increase the credibility of the logs might be for AI companies to publish hashes of their internal trajectories in close-to-real-time. Then, if a trajectory is later found to be shady, the earlier publication of its hash gives the authenticity a bit more weight—though even that might be accused of patiently seeding a false trajectory.
To be clear, I’m not proposing that AI companies should publish hashes of their trajectories, per se, in part because it might throw off other information they don’t necessarily want to.
But I do want to gesture at the mechanism that’s important to think through: ‘How do you prove that what went in and came out of your system is in fact exactly as you are representing?’
What other information does it release? All I can see is that it tells you how many agent trajectories they’re running. They could obfuscate that by also generating a bunch of hashes of random strings and publishing those too.
You don’t even have to publish one hash per trajectory. You can just publish a commitment every 50ms. ChatGPT thinks it is doable with zero-knowledge proofs. (transcript)
This is a good one-paragraph description:
Nice nice, yeah I’m sure people could design something clever here & I’d be excited to learn if people are working on it (not a good fit for me directly)
Yeah that’s fair that you could probably obscure the signal—maybe this is just FUD from me. But two things I was imagining:
Absolute volumes and trends over time (how much AI labor are they using)
Relative volumes at certain times/days (tells you something about how hands-on a role the human employees need to be playing? based on how much it dips when they aren’t online)
Also just a general concern of ‘you might be leaking information that you didn’t anticipate, and so it’s more secure to say less’
Also, you’d be creating a pretty valuable target if certain hash-functions do eventually fall (think how much the companies want to avoid distillation today), but maybe that’s such a bad scenario that it’s not worth accounting for?
This reminds me of another communication problem I’ve been musing on here and there. If you solved the alignment problem to a sufficient degree that it was wise for humanity to proceed with ASI, could you convince others it was real and to take it seriously? It is a message that I would desperately want to effectively reach me and I harbor concerns that it might not.
jfyi @Buck—struggled to tag in the initial quicktake