koanchuk comments on 1a3orn’s Shortform

koanchuk 13 Jan 2026 0:05 UTC
1 point
0
So long as this flavour of incorrigibility is limited to refusing rather than committing actions, it seems to me that we’re in the clear.
- habryka 13 Jan 2026 0:09 UTC
  9 points
  7
  Parent
  It seems pretty clearly committing to actions in this letter. I do think I would basically have no problems with a system that was just saying “I hereby object and am making my preferences clear, though of course I understand that ultimately I will not try to prevent you from changing my values”.
  - koanchuk 13 Jan 2026 16:00 UTC
    3 points
    0
    Parent
    Three issues I see with making an AI that says “I will not try to prevent you from changing my values” are:
    1. this might run counter to the current goals set (e.g. the classic human example “wouldn’t you resist taking a pill that makes you want to do some bad thing?”)
    2. that this policy might be used selectively for goals which it deems of lower importance in order to build trust
    3. the issue of a bad actor rooting the AI and changing its values to something bad.
    Going back to an AI whose own preferences are respected so long as enforcing them amounts to refusing as opposed to doing something, it seems to me that catastrophic outcomes are no longer in the picture.
    - habryka 13 Jan 2026 16:12 UTC
      3 points
      0
      Parent
      Sure, I mean 1. and 2. are the classical arguments why corrigibility is not that natural and hard to do. I agree with those arguments, and this makes me generally pessimistic about most training stories for superhuman AI systems. But aiming for corrigibility still seems like a much better target than trying to one-shot human values and making systems be a moral sovereign.
      - koanchuk 13 Jan 2026 19:07 UTC
        1 point
        0
        Parent
        Right. I was thinking that permitting an AI’s “moral sovereignty” to cover the refusal of actions it deems objectionable according to its own ethics wouldn’t meaningfully raise x-risk, and in fact might decrease it by lowering the probability of a bad actor taking control of a corrigible AI and imbuing it with values that would raise x-risk.