Q Home comments on A single principle related to many Alignment subproblems?

Q Home 20 May 2025 16:13 UTC
3 points
0
Could you reformulate the last paragraph as “I’m confused how your idea helps with alignment subrpoblem X”, “I think your idea might be inconsistent or having a failure mode because of Y”, or “I’m not sure how your idea could be used to define Z”?

Wrt the third paragraph. The post is about corrigible task ASI which could be instructed to protect humans from being killed/brainwashed/disempowered (and which won’t kill/brainwash/disempower people before it’s instructed to not do this). The post is not about value learning in the sense of “the AI learns plus-minus the entirety of human ethics and can build an utopia on its own”. I think developing my idea could help with such value learning, but I’m not sure I can easily back up this claim. Also, I don’t know how to apply my idea directly to neural networks.
- TristanTrim 20 May 2025 21:47 UTC
  2 points
  0
  Parent
  
  Could you reformulate the last paragraph
  
  I’ll try. I’m not sure how your idea could be used to define human values. I think your idea might have a failure mode around places where people are dissatisfied with their current understanding. I.e. situations where a human wants a more articulate model of the world then they have.
  
  The post is about corrigible task ASI
  
  Right. That makes sense. Sorry for asking a bunch of off topic questions then. I worry that task ASI could be dangerous even if it is corrigible, but ASI is obviously more dangerous when it isn’t corrigible, so I should probably develop my thinking about corrigibility.