I feel rather like you’re having an argument with someone else, which I’ve wandered into by accident.
Once again: I wasn’t trying to make a general prediction about how AI boxes fail or succeed, I was answering the question about under what circumstances a gatekeeper’s ruthlessness might be relevant to the AI Box game.
And, sure, if we only implement oracle suggestions that we fully understand and can fully reverse-engineer in every detail, and our techniques for doing that are sufficiently robust that an agent smarter than we are can’t come up with something that human minds will systematically fail to notice (perhaps because there is no such something to be found, because our minds are reliable), then the particular error I presumed for my example won’t happen, and the gatekeeper’s ruthlessness won’t be necessary in that scenario.
I feel rather like you’re having an argument with someone else, which I’ve wandered into by accident.
Once again: I wasn’t trying to make a general prediction about how AI boxes fail or succeed, I was answering the question about under what circumstances a gatekeeper’s ruthlessness might be relevant to the AI Box game.
And, sure, if we only implement oracle suggestions that we fully understand and can fully reverse-engineer in every detail, and our techniques for doing that are sufficiently robust that an agent smarter than we are can’t come up with something that human minds will systematically fail to notice (perhaps because there is no such something to be found, because our minds are reliable), then the particular error I presumed for my example won’t happen, and the gatekeeper’s ruthlessness won’t be necessary in that scenario.
You are right—I read more into your post than was warranted. My apologies.