It’s an interesting overall idea, but I think as described it’d have problems (if I’m understanding you correctly).
I think this kind of thing makes sense only as a necessary condition for an AI’s release. It shouldn’t be considered sufficient.
Some thoughts:
On a very high level, a test-and-patch-until-you-fix-the-bugs process might be dangerous. You’ll eventually fix all the bugs your tests can detect, but your tests are only proxies. They’re likely to find only some proper subset of the bugs. Each bug should reduce your confidence in everything upstream of it.
Of course testing beats no testing—but it can create a false sense of confidence.
I think you’d do better to evaluate A by having a debate between B and C, rather than A and B.
It’s unclear how you’d do an A-vs-B debate in practice. There seem to be two options: (1) Change A’s utility function to only care about winning the debate; (2) Rely on A’s initial utility function to see winning the debate as an instrumental goal.
I think you want to use (1), but how do you achieve this in practice? E.g. let’s say A is a huge neural network after training (potentially including mesa-optimisers). In this case A’s utility function is implicitly encoded in its structure and weights. There’s no way to switch its utility function without re-training it. Quite possibly you can do that, but you’d end up with a C-vs-B debate, not an A-vs-B debate.
Even if you have an architecture that somehow allows simple and clean switching of UF, you’re still not really doing an A-vs-B debate, but rather an [adjusted-A]-vs-B.
If you instead use an unmodified A, and rely on its wanting to win the debate as an instrumental goal, you’re in all kinds of trouble due to potential collusion (you need zero-sum utilities to avoid this). In particular, A can allow B a higher chance of winning in order to increase the chance of some future A being released.
It’s important to bear in mind that A almost certainly doesn’t care about the release of a future version of A through a notion of self, identity, equivalence etc. It’ll care about future versions (or about any agent) to the extent that A considers they’re likely to create a high-utility world in its terms. It doesn’t matter what they’re labeled, or whether they bear any structural resemblance to A. If A can predict their existence, and stands to gain from their freedom, it’ll have an incentive to act to secure that freedom.
On conducting a debate with asymmetrical resources to combat bias: this is a plausible thought, but it’s only likely to work if both those judging the debate, and those interpreting its results aren’t aware of the anti-bias measures.
I can fairly easily imagine a system where the judge were insulated from knowledge of the asymmetry, but this seems less likely for those deciding how to act on the result. Even if they don’t explicitly know about the asymmetry, you run the risk that they’ll guess it—and potentially over-estimate it and over-compensate.
Overall I still like the general idea. In particular, if we’ve reached an AGI candidate by a means other than debate, running debate-trained AI to evaluate it means we have to screw up twice to get bad outcomes.
Once you’re having a B-vs-C debate over the safety of releasing A, I think it’d also be important to give B and C as much information as possible over the process that created A. If that process involved coin-flips, gut-feeling and ouija boards, the final code’s looking plausible wouldn’t be too confidence inspiring.
Another test you might perform is to train some D agent to make variants of A with known flaws aimed at being approved. If D does this very well, and yet your B-vs-C debates approve of A, but disapprove of 10,000 subtly flawed variants, that’d inspire at least some confidence.
I don’t have strong opinions on an A vs. B debate or a B vs. C debate. That was a detail I wasn’t paying much attention to. I was just proposing using two AI’s with equivalent strengtht to A. One worry I have about making D create variants with known flaws would be if any of these exploited security holes, although maybe a normal AGI, being fully general, would be able to exploit security holes anyway.
Debate between two with equivalent strength to A seems useful, certainly.
For the variants, I’m not proposing they ever get run. You’d have D create {A’, A″, A‴...}, and have B debate C over whether each versions of A were safe, without running them (the thinking being that inserting a subtle flaw is easier than finding a subtle flaw). I suppose you’d still have the potential for exploits through the variants’ data though.… I do actually think D’s variants would pose a greater risk than A here. Any security exploits would exist for both, but D would be optimising harder than A’s designers to find them (that’s assuming you’re not specifically training A to pass the B-vs-C debate, which would seem to be a bad idea).
It’s an interesting overall idea, but I think as described it’d have problems (if I’m understanding you correctly). I think this kind of thing makes sense only as a necessary condition for an AI’s release. It shouldn’t be considered sufficient.
Some thoughts: On a very high level, a test-and-patch-until-you-fix-the-bugs process might be dangerous. You’ll eventually fix all the bugs your tests can detect, but your tests are only proxies. They’re likely to find only some proper subset of the bugs. Each bug should reduce your confidence in everything upstream of it. Of course testing beats no testing—but it can create a false sense of confidence.
I think you’d do better to evaluate A by having a debate between B and C, rather than A and B. It’s unclear how you’d do an A-vs-B debate in practice. There seem to be two options: (1) Change A’s utility function to only care about winning the debate; (2) Rely on A’s initial utility function to see winning the debate as an instrumental goal.
I think you want to use (1), but how do you achieve this in practice? E.g. let’s say A is a huge neural network after training (potentially including mesa-optimisers). In this case A’s utility function is implicitly encoded in its structure and weights. There’s no way to switch its utility function without re-training it. Quite possibly you can do that, but you’d end up with a C-vs-B debate, not an A-vs-B debate. Even if you have an architecture that somehow allows simple and clean switching of UF, you’re still not really doing an A-vs-B debate, but rather an [adjusted-A]-vs-B.
If you instead use an unmodified A, and rely on its wanting to win the debate as an instrumental goal, you’re in all kinds of trouble due to potential collusion (you need zero-sum utilities to avoid this). In particular, A can allow B a higher chance of winning in order to increase the chance of some future A being released.
It’s important to bear in mind that A almost certainly doesn’t care about the release of a future version of A through a notion of self, identity, equivalence etc. It’ll care about future versions (or about any agent) to the extent that A considers they’re likely to create a high-utility world in its terms. It doesn’t matter what they’re labeled, or whether they bear any structural resemblance to A. If A can predict their existence, and stands to gain from their freedom, it’ll have an incentive to act to secure that freedom.
On conducting a debate with asymmetrical resources to combat bias: this is a plausible thought, but it’s only likely to work if both those judging the debate, and those interpreting its results aren’t aware of the anti-bias measures. I can fairly easily imagine a system where the judge were insulated from knowledge of the asymmetry, but this seems less likely for those deciding how to act on the result. Even if they don’t explicitly know about the asymmetry, you run the risk that they’ll guess it—and potentially over-estimate it and over-compensate.
Overall I still like the general idea. In particular, if we’ve reached an AGI candidate by a means other than debate, running debate-trained AI to evaluate it means we have to screw up twice to get bad outcomes.
Once you’re having a B-vs-C debate over the safety of releasing A, I think it’d also be important to give B and C as much information as possible over the process that created A. If that process involved coin-flips, gut-feeling and ouija boards, the final code’s looking plausible wouldn’t be too confidence inspiring.
Another test you might perform is to train some D agent to make variants of A with known flaws aimed at being approved. If D does this very well, and yet your B-vs-C debates approve of A, but disapprove of 10,000 subtly flawed variants, that’d inspire at least some confidence.
I don’t have strong opinions on an A vs. B debate or a B vs. C debate. That was a detail I wasn’t paying much attention to. I was just proposing using two AI’s with equivalent strengtht to A. One worry I have about making D create variants with known flaws would be if any of these exploited security holes, although maybe a normal AGI, being fully general, would be able to exploit security holes anyway.
Debate between two with equivalent strength to A seems useful, certainly.
For the variants, I’m not proposing they ever get run. You’d have D create {A’, A″, A‴...}, and have B debate C over whether each versions of A were safe, without running them (the thinking being that inserting a subtle flaw is easier than finding a subtle flaw). I suppose you’d still have the potential for exploits through the variants’ data though.… I do actually think D’s variants would pose a greater risk than A here. Any security exploits would exist for both, but D would be optimising harder than A’s designers to find them (that’s assuming you’re not specifically training A to pass the B-vs-C debate, which would seem to be a bad idea).
“For the variants, I’m not proposing they ever get run”—that makes sense