TurnTrout comments on Failure modes in a shard theory alignment plan

TurnTrout 27 Sep 2022 22:56 UTC
LW: 2 AF: 2
0
AF
I think I have two main complaints still, on a skim.
First, I think the following is wrong:
These problems seem pretty hard, as evidenced by the above gaps in the example shard theory plan, and it’s not clear that they’re easier than inner and outer alignment.
I think outer and inner alignment both go against known/suspected grains of cognitive tendencies, whereas shard theory stories do not. I think that outer and inner alignment decompose a hard problem into two extremely hard problems, whereas shard theory is aiming to address a hard problem more naturally. I will have a post out in the next week or two being more specific, but I wanted to flag that I very much disagree with this quote.
Second, I’m wary of saying “maybe we can get corrigibility” or “maybe corrigibility doesn’t fit into a utility function”, because this can map shard theory hopes onto old debates where we already have settled into positions. Whereas I consider myself to be thinking about qualitatively different questions and spreads of values I might hope to get into an AI.
I think corrigibility is natural iff robust pointers to it can easily get into the AI’s goals
This doesn’t make sense to me. It sounds like saying “liking yellow cubes is natural iff we can get a pointer to ‘liking yellow cubes’ within the AI’s goals.” That sounds like a thing which would be said if we had no idea how yellow cubes got liked, directly, and were instead treating liking-yellow-cubeness as a black box which happened to exist in the real world (e.g. how corrigibility, or the desire to help people, could be “pointed to” in a classic corrigibility hope).
I have more thoughts on this post but I don’t have time to type more for now.
- Thomas Kwa 27 Sep 2022 23:32 UTC
  2 points
  0
  Parent
  I think outer and inner alignment both go against known/suspected grains of cognitive tendencies, whereas shard theory stories do not. I think that outer and inner alignment decompose a hard problem into two extremely hard problems, whereas shard theory is aiming to address a hard problem more naturally. I will have a post out in the next week or two being more specific, but I wanted to flag that I very much disagree with this quote.
  Since the original draft I realized your position has “outer/inner alignment is a broken frame with mismatched type signatures which is much less likely to work than people think”, so this seems reasonable from your perspective. I haven’t thought much about this document and might end up agreeing with you, so the version I believe is something like “it’s not clear that my shard theory decomposition is substantially easier than inner+outer alignment is, assuming that inner+outer alignment is as valid as Evan thinks it is”.
  Agree that I’m not being concrete about how corrigibility would be implemented. Concreteness is a virtue and it seems good to think about this in more detail eventually.