This is probably a very naive proposal, but: Has anyone ever looked into giving a robot a vulnerability that it could use to wirehead itself if it was unaligned so that if it were to escape human control, it would end up being harmless?
You’d probably want to incentivize the robot to wirehead itself as fast as possible to minimise any effects that it would have on the surrounding world. Maybe you could even give it some bonus points for burning up computation, but not so much that it would want to delay hitting the wireheading button by more than a tiny fraction, say no more than an additional 1% of time.
And you’d also want it such that once it hit the wireheading state, it wouldn’t have any desire or need to maintain it, as it would have already achieved its objective.
Is there any obvious reason why this idea wouldn’t be viable (apart from the fact that we currently can’t put any reward function in an AI)?
Corrigibility would render Chris’s idea unnecessary, but doesn’t actually argue against why Chris’s idea wouldn’t work. Unless there’s some argument for “If you could implement Chris’s idea, you could also implement corrigibility” or something along those lines.
It can be rephrased as a variation of the off button, where rather than just turning itself off, it runs NOPs, and rather than getting pushed manually, it’s triggered by escaping (however that could be defined). A lot of the problems raised in the original paper should also apply to honeypots.
This is probably a very naive proposal, but: Has anyone ever looked into giving a robot a vulnerability that it could use to wirehead itself if it was unaligned so that if it were to escape human control, it would end up being harmless?
You’d probably want to incentivize the robot to wirehead itself as fast as possible to minimise any effects that it would have on the surrounding world. Maybe you could even give it some bonus points for burning up computation, but not so much that it would want to delay hitting the wireheading button by more than a tiny fraction, say no more than an additional 1% of time.
And you’d also want it such that once it hit the wireheading state, it wouldn’t have any desire or need to maintain it, as it would have already achieved its objective.
Is there any obvious reason why this idea wouldn’t be viable (apart from the fact that we currently can’t put any reward function in an AI)?
I think this sounds a bit like a “honeypot”. There are some posts mentioning honeypots, but they don’t discuss it in much detail: Impact Measure Testing, Improved Security, Encourage premature AI rebellion
Check out corrigibility (someone really should write that tag...), starting from this paper.
Corrigibility would render Chris’s idea unnecessary, but doesn’t actually argue against why Chris’s idea wouldn’t work. Unless there’s some argument for “If you could implement Chris’s idea, you could also implement corrigibility” or something along those lines.
It can be rephrased as a variation of the off button, where rather than just turning itself off, it runs NOPs, and rather than getting pushed manually, it’s triggered by escaping (however that could be defined). A lot of the problems raised in the original paper should also apply to honeypots.