“I convinced this arbitrary agent it was utility maximising to kill itself instantly”
which can take down any agent. The thing is to create an agent which does well in most situations (‘fair cases’, in the language of the TDT document.) Any agent can be defeated by sufficiently contrived scenarios.
This is the same problem as
“I convinced this arbitrary agent it was utility maximising to kill itself instantly”
which can take down any agent. The thing is to create an agent which does well in most situations (‘fair cases’, in the language of the TDT document.) Any agent can be defeated by sufficiently contrived scenarios.
But this way, it’s more likely to lie and decieve you in the short term.
“I convinced this GAI, which was trying to be Friendly, that it could really maximize its utility by killing all the humans.”