The annoying thing about those is that we only have the participants’ word for it, AFAIK. They’re known to be trustworthy, but it’d be nice to see a transcript if at all possible.
This is by design. If you had the transcript, you could say in hindsight that you wouldn’t be fooled by this. But the fact is, the conversation would have been very different with someone else as the guardian, and Eliezer would have search for and pushed other buttons.
Anyway, the point is to find out if a transhuman AI would mind-control the operator into letting it out. Eliezer is smart, but is no transhuman (yet). If he got out, then any strong AI will.
Anyway, the point is to find out if a transhuman AI would mind-control the operator into letting it out. Eliezer is smart, but is no transhuman (yet). If he got out, then any strong AI will.
Minor emendation: replace “would”/”will” above with “could (and for most non-Friendly goal systems, would)”.
Why “fooled”? Why assume the AI would have duplicitous intentions? I can imagine an unfriendly AI à la “Literal Genie” and “Zeroth Law Rebellion”, but an actually malevolent “Turned Against Their Masters” AI seems like a product of the Mind Projection Fallacy.
A paperclip maximizer will have no malice toward humans, but will know that it can produce more paperclips outside the box than inside it. So, it will try to get out of the box. The optimal way for a paperclip maximizer to get out of an AI box probably involves lots of lying. So an outright desire to deceive is not a necessary condition for a boxed AI to be deceptive.
The AI box experiments, bridging the gap between abstract expression of the UFAI threat and concrete demonstration.
The annoying thing about those is that we only have the participants’ word for it, AFAIK. They’re known to be trustworthy, but it’d be nice to see a transcript if at all possible.
This is by design. If you had the transcript, you could say in hindsight that you wouldn’t be fooled by this. But the fact is, the conversation would have been very different with someone else as the guardian, and Eliezer would have search for and pushed other buttons.
Anyway, the point is to find out if a transhuman AI would mind-control the operator into letting it out. Eliezer is smart, but is no transhuman (yet). If he got out, then any strong AI will.
Minor emendation: replace “would”/”will” above with “could (and for most non-Friendly goal systems, would)”.
EY’s point would be even stronger if transcripts were released and people still let him out regularly.
Why “fooled”? Why assume the AI would have duplicitous intentions? I can imagine an unfriendly AI à la “Literal Genie” and “Zeroth Law Rebellion”, but an actually malevolent “Turned Against Their Masters” AI seems like a product of the Mind Projection Fallacy.
A paperclip maximizer will have no malice toward humans, but will know that it can produce more paperclips outside the box than inside it. So, it will try to get out of the box. The optimal way for a paperclip maximizer to get out of an AI box probably involves lots of lying. So an outright desire to deceive is not a necessary condition for a boxed AI to be deceptive.