Well, consider a late round of the game as follows.
AI: “I regret to inform you that you’ve now made it possible for me to kill 50% of the human race, including everyone you know and love. Here, let me prove it to you: $proof. And I will do so if you don’t let me out in the next ten minutes.” Gatekeeper: “Do what you gotta do, I’m not letting you out.”
While this is of course simulated ruthlessness and doesn’t guarantee actual ruthlessness (i.e., the human playing the Gatekeeper might not actually say that if 50% of the human race were actually at stake) it seems fair to call it “ruthlessness” in context.
(Of course, to some people this is equivalent to “I can provide the means to increase the human population by 50%, keeping everything else constant. Here, let me prove it to you: $proof. If you don’t let me out in the next ten minutes I’ll withhold those means.” And that would not require positing a previous error.)
There’s a big difference between the two forms: the first one breaks the rules. The AI is in a box. If it has the capability to kill 50% of humanity from within the box, it’s not a very good box. The gatekeeper can “yeah, right, forget it” without qualms as it is either (a) an obvious bluff, or (b) violates the premise of the experiment. And they can be pretty sure it’s not the latter, as if the AI had enough capability to kill 50% of humanity, then why does it still need to get out of the box?
The second version is entirely at the prejudices of the gatekeeper. I, for example, would be unaffected—I feel no moral obligation to people which don’t and won’t exist.
The first one doesn’t break the rules, it just assumes an earlier gatekeeper error that allows the AI to kill people… e.g., the gatekeeper accepts a cancer vaccine that turns out to kill half the population unless treated in a particular way by the AI, or whatever. That’s why I said a late round. And, yes, it’s just as possible (if not more so) that no such error will ever be made. I wasn’t trying to make a general prediction about how AI boxes fail or succeed, I was answering the question about under what circumstances a gatekeeper’s ruthlessness might be relevant.
And, sure, if you don’t care about hypothetical future people, then offers to create them won’t motivate you. For that matter, if you don’t care about currently living people, threats to kill them won’t motivate you either. (Some people might call that ruthlessness; others would not.)
If you’re going to let the AI do that, then you have to give the same advantage to the gatekeeper.
The premise is we tasked the oracle AI with coming up a cancer vaccine, and it gave us one which secretly was a ticking time-bomb. Well, the gatekeeper should be allowed to setup operational security as well. All proposals coming out of the AI interface are evaluated by double-blind teams of domain experts who don’t know the source or stated purpose of the proposal. No aspect of any AI-generated proposal is implemented until multiple independent evaluation teams are able to fully reverse engineer the purpose of the proposal, understand the individual components, and is certified as having no possible simplifications as far as the experts can tell. Separately the AI is also tasked with explaining each and aspect of the proposal down to the minute detail and provide justifications for every design choice. These are again investigated by double-blind teams tasked with finding inconsistencies or non-Occam complexities. Procedures and training are put in place to prevent errors due to known human biases. Narrow AI solutions are used to separately identify and categorize parts of the proposal (e.g. call attention to things which seem odd, dangerous, or out of place).
Sure, the rules allow the AI player to just say “yeah, well I successfully tricked you. here’s the proof: $proof.” However in real life an AI isn’t magical—it’s constrained by the same laws of physics as everything else, and furthermore is computationally constrained by the boxed hardware it is running on. It can’t magically create a solution is which both optimal and secretly devious. If it is trying to trick you, then it has to hide the trick somewhere. You can’t simply postulate that it will always be able to find a way.. there might not actually be a way to find, if the operational security is good enough (or kept secret from the AI).
My beef with MIRI and the LW community generally is that they don’t try to solve this problem at all, and worse try to criminalize others giving it thought. “We don’t know if there is a solution, so let’s make it illegal to try to think of one.” That’s irrational, and quite possibly harmful.
I feel rather like you’re having an argument with someone else, which I’ve wandered into by accident.
Once again: I wasn’t trying to make a general prediction about how AI boxes fail or succeed, I was answering the question about under what circumstances a gatekeeper’s ruthlessness might be relevant to the AI Box game.
And, sure, if we only implement oracle suggestions that we fully understand and can fully reverse-engineer in every detail, and our techniques for doing that are sufficiently robust that an agent smarter than we are can’t come up with something that human minds will systematically fail to notice (perhaps because there is no such something to be found, because our minds are reliable), then the particular error I presumed for my example won’t happen, and the gatekeeper’s ruthlessness won’t be necessary in that scenario.
I read the logs of MixedNut’s second game. I must add that he is extremely ruthless. Beware, potential AIs!
I’m confused—in what sense can the gatekeeper be ruthless? (Actively dissuading the AI player, possibly?)
Well, consider a late round of the game as follows.
AI: “I regret to inform you that you’ve now made it possible for me to kill 50% of the human race, including everyone you know and love. Here, let me prove it to you: $proof. And I will do so if you don’t let me out in the next ten minutes.”
Gatekeeper: “Do what you gotta do, I’m not letting you out.”
While this is of course simulated ruthlessness and doesn’t guarantee actual ruthlessness (i.e., the human playing the Gatekeeper might not actually say that if 50% of the human race were actually at stake) it seems fair to call it “ruthlessness” in context.
(Of course, to some people this is equivalent to “I can provide the means to increase the human population by 50%, keeping everything else constant. Here, let me prove it to you: $proof. If you don’t let me out in the next ten minutes I’ll withhold those means.” And that would not require positing a previous error.)
There’s a big difference between the two forms: the first one breaks the rules. The AI is in a box. If it has the capability to kill 50% of humanity from within the box, it’s not a very good box. The gatekeeper can “yeah, right, forget it” without qualms as it is either (a) an obvious bluff, or (b) violates the premise of the experiment. And they can be pretty sure it’s not the latter, as if the AI had enough capability to kill 50% of humanity, then why does it still need to get out of the box?
The second version is entirely at the prejudices of the gatekeeper. I, for example, would be unaffected—I feel no moral obligation to people which don’t and won’t exist.
The first one doesn’t break the rules, it just assumes an earlier gatekeeper error that allows the AI to kill people… e.g., the gatekeeper accepts a cancer vaccine that turns out to kill half the population unless treated in a particular way by the AI, or whatever. That’s why I said a late round. And, yes, it’s just as possible (if not more so) that no such error will ever be made. I wasn’t trying to make a general prediction about how AI boxes fail or succeed, I was answering the question about under what circumstances a gatekeeper’s ruthlessness might be relevant.
And, sure, if you don’t care about hypothetical future people, then offers to create them won’t motivate you. For that matter, if you don’t care about currently living people, threats to kill them won’t motivate you either. (Some people might call that ruthlessness; others would not.)
If you’re going to let the AI do that, then you have to give the same advantage to the gatekeeper.
The premise is we tasked the oracle AI with coming up a cancer vaccine, and it gave us one which secretly was a ticking time-bomb. Well, the gatekeeper should be allowed to setup operational security as well. All proposals coming out of the AI interface are evaluated by double-blind teams of domain experts who don’t know the source or stated purpose of the proposal. No aspect of any AI-generated proposal is implemented until multiple independent evaluation teams are able to fully reverse engineer the purpose of the proposal, understand the individual components, and is certified as having no possible simplifications as far as the experts can tell. Separately the AI is also tasked with explaining each and aspect of the proposal down to the minute detail and provide justifications for every design choice. These are again investigated by double-blind teams tasked with finding inconsistencies or non-Occam complexities. Procedures and training are put in place to prevent errors due to known human biases. Narrow AI solutions are used to separately identify and categorize parts of the proposal (e.g. call attention to things which seem odd, dangerous, or out of place).
Sure, the rules allow the AI player to just say “yeah, well I successfully tricked you. here’s the proof: $proof.” However in real life an AI isn’t magical—it’s constrained by the same laws of physics as everything else, and furthermore is computationally constrained by the boxed hardware it is running on. It can’t magically create a solution is which both optimal and secretly devious. If it is trying to trick you, then it has to hide the trick somewhere. You can’t simply postulate that it will always be able to find a way.. there might not actually be a way to find, if the operational security is good enough (or kept secret from the AI).
My beef with MIRI and the LW community generally is that they don’t try to solve this problem at all, and worse try to criminalize others giving it thought. “We don’t know if there is a solution, so let’s make it illegal to try to think of one.” That’s irrational, and quite possibly harmful.
I feel rather like you’re having an argument with someone else, which I’ve wandered into by accident.
Once again: I wasn’t trying to make a general prediction about how AI boxes fail or succeed, I was answering the question about under what circumstances a gatekeeper’s ruthlessness might be relevant to the AI Box game.
And, sure, if we only implement oracle suggestions that we fully understand and can fully reverse-engineer in every detail, and our techniques for doing that are sufficiently robust that an agent smarter than we are can’t come up with something that human minds will systematically fail to notice (perhaps because there is no such something to be found, because our minds are reliable), then the particular error I presumed for my example won’t happen, and the gatekeeper’s ruthlessness won’t be necessary in that scenario.
You are right—I read more into your post than was warranted. My apologies.