If you’re going to let the AI do that, then you have to give the same advantage to the gatekeeper.
The premise is we tasked the oracle AI with coming up a cancer vaccine, and it gave us one which secretly was a ticking time-bomb. Well, the gatekeeper should be allowed to setup operational security as well. All proposals coming out of the AI interface are evaluated by double-blind teams of domain experts who don’t know the source or stated purpose of the proposal. No aspect of any AI-generated proposal is implemented until multiple independent evaluation teams are able to fully reverse engineer the purpose of the proposal, understand the individual components, and is certified as having no possible simplifications as far as the experts can tell. Separately the AI is also tasked with explaining each and aspect of the proposal down to the minute detail and provide justifications for every design choice. These are again investigated by double-blind teams tasked with finding inconsistencies or non-Occam complexities. Procedures and training are put in place to prevent errors due to known human biases. Narrow AI solutions are used to separately identify and categorize parts of the proposal (e.g. call attention to things which seem odd, dangerous, or out of place).
Sure, the rules allow the AI player to just say “yeah, well I successfully tricked you. here’s the proof: $proof.” However in real life an AI isn’t magical—it’s constrained by the same laws of physics as everything else, and furthermore is computationally constrained by the boxed hardware it is running on. It can’t magically create a solution is which both optimal and secretly devious. If it is trying to trick you, then it has to hide the trick somewhere. You can’t simply postulate that it will always be able to find a way.. there might not actually be a way to find, if the operational security is good enough (or kept secret from the AI).
My beef with MIRI and the LW community generally is that they don’t try to solve this problem at all, and worse try to criminalize others giving it thought. “We don’t know if there is a solution, so let’s make it illegal to try to think of one.” That’s irrational, and quite possibly harmful.
I feel rather like you’re having an argument with someone else, which I’ve wandered into by accident.
Once again: I wasn’t trying to make a general prediction about how AI boxes fail or succeed, I was answering the question about under what circumstances a gatekeeper’s ruthlessness might be relevant to the AI Box game.
And, sure, if we only implement oracle suggestions that we fully understand and can fully reverse-engineer in every detail, and our techniques for doing that are sufficiently robust that an agent smarter than we are can’t come up with something that human minds will systematically fail to notice (perhaps because there is no such something to be found, because our minds are reliable), then the particular error I presumed for my example won’t happen, and the gatekeeper’s ruthlessness won’t be necessary in that scenario.
If you’re going to let the AI do that, then you have to give the same advantage to the gatekeeper.
The premise is we tasked the oracle AI with coming up a cancer vaccine, and it gave us one which secretly was a ticking time-bomb. Well, the gatekeeper should be allowed to setup operational security as well. All proposals coming out of the AI interface are evaluated by double-blind teams of domain experts who don’t know the source or stated purpose of the proposal. No aspect of any AI-generated proposal is implemented until multiple independent evaluation teams are able to fully reverse engineer the purpose of the proposal, understand the individual components, and is certified as having no possible simplifications as far as the experts can tell. Separately the AI is also tasked with explaining each and aspect of the proposal down to the minute detail and provide justifications for every design choice. These are again investigated by double-blind teams tasked with finding inconsistencies or non-Occam complexities. Procedures and training are put in place to prevent errors due to known human biases. Narrow AI solutions are used to separately identify and categorize parts of the proposal (e.g. call attention to things which seem odd, dangerous, or out of place).
Sure, the rules allow the AI player to just say “yeah, well I successfully tricked you. here’s the proof: $proof.” However in real life an AI isn’t magical—it’s constrained by the same laws of physics as everything else, and furthermore is computationally constrained by the boxed hardware it is running on. It can’t magically create a solution is which both optimal and secretly devious. If it is trying to trick you, then it has to hide the trick somewhere. You can’t simply postulate that it will always be able to find a way.. there might not actually be a way to find, if the operational security is good enough (or kept secret from the AI).
My beef with MIRI and the LW community generally is that they don’t try to solve this problem at all, and worse try to criminalize others giving it thought. “We don’t know if there is a solution, so let’s make it illegal to try to think of one.” That’s irrational, and quite possibly harmful.
I feel rather like you’re having an argument with someone else, which I’ve wandered into by accident.
Once again: I wasn’t trying to make a general prediction about how AI boxes fail or succeed, I was answering the question about under what circumstances a gatekeeper’s ruthlessness might be relevant to the AI Box game.
And, sure, if we only implement oracle suggestions that we fully understand and can fully reverse-engineer in every detail, and our techniques for doing that are sufficiently robust that an agent smarter than we are can’t come up with something that human minds will systematically fail to notice (perhaps because there is no such something to be found, because our minds are reliable), then the particular error I presumed for my example won’t happen, and the gatekeeper’s ruthlessness won’t be necessary in that scenario.
You are right—I read more into your post than was warranted. My apologies.