Chris_Leong comments on All AGI Safety questions welcome (especially basic ones) [~monthly thread]

Chris_Leong 1 Feb 2023 16:26 UTC
2 points
0
This is probably a very naive proposal, but: Has anyone ever looked into giving a robot a vulnerability that it could use to wirehead itself if it was unaligned so that if it were to escape human control, it would end up being harmless?

You’d probably want to incentivize the robot to wirehead itself as fast as possible to minimise any effects that it would have on the surrounding world. Maybe you could even give it some bonus points for burning up computation, but not so much that it would want to delay hitting the wireheading button by more than a tiny fraction, say no more than an additional 1% of time.
And you’d also want it such that once it hit the wireheading state, it wouldn’t have any desire or need to maintain it, as it would have already achieved its objective.
Is there any obvious reason why this idea wouldn’t be viable (apart from the fact that we currently can’t put any reward function in an AI)?
- David Johnston 2 Feb 2023 5:37 UTC
  3 points
  0
  Parent
  I think this sounds a bit like a “honeypot”. There are some posts mentioning honeypots, but they don’t discuss it in much detail: Impact Measure Testing, Improved Security, Encourage premature AI rebellion
- mruwnik 1 Feb 2023 22:17 UTC
  1 point
  0
  Parent
  Check out corrigibility (someone really should write that tag...), starting from this paper.
  - Jay Bailey 1 Feb 2023 23:17 UTC
    1 point
    0
    Parent
    Corrigibility would render Chris’s idea unnecessary, but doesn’t actually argue against why Chris’s idea wouldn’t work. Unless there’s some argument for “If you could implement Chris’s idea, you could also implement corrigibility” or something along those lines.
    - mruwnik 2 Feb 2023 10:41 UTC
      1 point
      0
      Parent
      It can be rephrased as a variation of the off button, where rather than just turning itself off, it runs NOPs, and rather than getting pushed manually, it’s triggered by escaping (however that could be defined). A lot of the problems raised in the original paper should also apply to honeypots.