The problem with that is that both EY and I suspect that if the logs were actually released, or any significant details given about the exact methods of persuasion used, people could easily point towards those arguments and say: “That definitely wouldn’t have worked on me!”—since it’s really easy to feel that way when you’re not the subject being manipulated.
From EY’s rules:
If Gatekeeper lets the AI out, naysayers can’t say “Oh, I wouldn’t have been convinced by that.” As long as they don’t know what happened to the Gatekeeper, they can’t argue themselves into believing it wouldn’t happen to them.
I don’t care about “me”, I care about hypothetical gatekeeper “X”.
Even if my ego prevents me from accepting that I might be persuaded by “Y”, I can easily admit that “X” could be persuaded by “Y”. In this case, exhibiting a particular “Y” that seems like it could persuade “X” is an excellent argument against creating the situation that allows “X” to be persuaded by “Y”. The more and varied the “Y” we can produce, the less smart putting humans in this situation looks. And isn’t that what we’re trying to argue here? That AI-boxing isn’t safe because people will be convinced by “Y”?
We do this all the time in arguing for why certain political powers shouldn’t be given. “The corrupting influence of power” is a widely accepted argument against having benign dictators, even if we think we’re personally exempt. How could you say “Dictators would do bad things because of Y, but I can’t even tell you Y because you’d claim that you wouldn’t fall for it” and expect to persuade anyone?
And if you posit that doing Z is sufficiently bad, then you don’t need recourse to any exotic arguments to show that we shouldn’t give people the option of doing Z. Eventually someone will do Z for money, or from fear, or because God told them to do Z, or maybe there’s just really stupid. I’m a little peeved I can’t geek out of the cool arguments people are coming up with because of this obscurantism.
There are other arguments I can think of for not sharing strong strategies, but they are either cynical or circular. Cynical explanations are obvious. On circular arguments: Isn’t an argument for letting the AI out of the box an argument for building the AI in the first place? Isn’t that the whole shtick here?
Provided people keep playing this game, this will eventually happen anyway. And if in that eventual released log of an AI victory, the gatekeeper is persuaded by less compelling strategies than yours, it would be even easier to believe “it couldn’t happen to me”.
Secondly, since we’re assuming Oracle AI is possible and boxing seems to be most people’s default strategy for when that happens, there will be future gatekeepers facing actual AIs. Shouldn’t you try to immunize them against at least some of the strategies AIs could conceivably discover independently?
The number of people actually playing this game is quite small, and the number of winning AIs is even smaller (to the point where Tuxedage can charge $750 a round and isn’t immediately flooded with competitors). And secrecy is considered part of the game’s standard rules. So it is not obvious that AI win logs will eventually be released anyway.
The number of people actually playing this game is quite small, and the number of winning AIs is even smaller (to the point where Tuxedage can charge $750 a round and isn’t immediately flooded with competitors).
A round seems to need the 2 hours on the chat but also many hours in background research. If we say 8 hours background research and script writing that would equal $75/hour. I think that most people with advanced persuasion skills can make a higher hourly rate.
Shouldn’t you try to immunize them against at least some of the strategies AIs could conceivably discover independently?
I don’t think reading a few logs would immunize someone. If you wanted to immunize someone I would suggest a few years of therapy with a good psychologist to work through any trauma’s that exist in that person’s life and the existential questions.
I would add many hours in meditation to have learn to have control over your own mind.
You could train someone to precommit and build emotional endurance. If someone can take highly addictive drugs and has a enough control over his own mind to refuse them when put a few hours alone in a room with them I would trust them more to stay emotionally stable in front of an AI.
You could also require gatekeepers to have played the AI role in the experiment a few times.
You might also look into techniques that the military teaches soldiers to resist torture.
But even with all these safety measures it’s still dangerous.
The problem with that is that both EY and I suspect that if the logs were actually released, or any significant details given about the exact methods of persuasion used, people could easily point towards those arguments and say: “That definitely wouldn’t have worked on me!”—since it’s really easy to feel that way when you’re not the subject being manipulated.
From EY’s rules:
I don’t understand.
I don’t care about “me”, I care about hypothetical gatekeeper “X”.
Even if my ego prevents me from accepting that I might be persuaded by “Y”, I can easily admit that “X” could be persuaded by “Y”. In this case, exhibiting a particular “Y” that seems like it could persuade “X” is an excellent argument against creating the situation that allows “X” to be persuaded by “Y”. The more and varied the “Y” we can produce, the less smart putting humans in this situation looks. And isn’t that what we’re trying to argue here? That AI-boxing isn’t safe because people will be convinced by “Y”?
We do this all the time in arguing for why certain political powers shouldn’t be given. “The corrupting influence of power” is a widely accepted argument against having benign dictators, even if we think we’re personally exempt. How could you say “Dictators would do bad things because of Y, but I can’t even tell you Y because you’d claim that you wouldn’t fall for it” and expect to persuade anyone?
And if you posit that doing Z is sufficiently bad, then you don’t need recourse to any exotic arguments to show that we shouldn’t give people the option of doing Z. Eventually someone will do Z for money, or from fear, or because God told them to do Z, or maybe there’s just really stupid. I’m a little peeved I can’t geek out of the cool arguments people are coming up with because of this obscurantism.
There are other arguments I can think of for not sharing strong strategies, but they are either cynical or circular. Cynical explanations are obvious. On circular arguments: Isn’t an argument for letting the AI out of the box an argument for building the AI in the first place? Isn’t that the whole shtick here?
Provided people keep playing this game, this will eventually happen anyway. And if in that eventual released log of an AI victory, the gatekeeper is persuaded by less compelling strategies than yours, it would be even easier to believe “it couldn’t happen to me”.
Secondly, since we’re assuming Oracle AI is possible and boxing seems to be most people’s default strategy for when that happens, there will be future gatekeepers facing actual AIs. Shouldn’t you try to immunize them against at least some of the strategies AIs could conceivably discover independently?
The number of people actually playing this game is quite small, and the number of winning AIs is even smaller (to the point where Tuxedage can charge $750 a round and isn’t immediately flooded with competitors). And secrecy is considered part of the game’s standard rules. So it is not obvious that AI win logs will eventually be released anyway.
A round seems to need the 2 hours on the chat but also many hours in background research. If we say 8 hours background research and script writing that would equal $75/hour. I think that most people with advanced persuasion skills can make a higher hourly rate.
I don’t think reading a few logs would immunize someone. If you wanted to immunize someone I would suggest a few years of therapy with a good psychologist to work through any trauma’s that exist in that person’s life and the existential questions.
I would add many hours in meditation to have learn to have control over your own mind.
You could train someone to precommit and build emotional endurance. If someone can take highly addictive drugs and has a enough control over his own mind to refuse them when put a few hours alone in a room with them I would trust them more to stay emotionally stable in front of an AI.
You could also require gatekeepers to have played the AI role in the experiment a few times.
You might also look into techniques that the military teaches soldiers to resist torture.
But even with all these safety measures it’s still dangerous.