Over the last year, I’ve thought a lot about human/AI power dynamics and influence-seeking behavior. I personally haven’t used the strategy-stealing assumption (SSA) in reasoning about alignment, but it seems like a useful concept.
Overall, the post seems good. The analysis is well-reasoned and reasonably well-written, although it’s sprinkled with opaque remarks (I marked up a Google doc with more detail).
If this post is voted in, it might be nice if Paul gave more room to big-picture, broad-strokes “how does SSA tend to fail?” discussion, discussing potential commonalities between specific counterexamples, before enumerating the counterexamples in detail. Right now, “eleven ways the SSA could fail” feels like a grab-bag of considerations.
Over the last year, I’ve thought a lot about human/AI power dynamics and influence-seeking behavior. I personally haven’t used the strategy-stealing assumption (SSA) in reasoning about alignment, but it seems like a useful concept.
Overall, the post seems good. The analysis is well-reasoned and reasonably well-written, although it’s sprinkled with opaque remarks (I marked up a Google doc with more detail).
If this post is voted in, it might be nice if Paul gave more room to big-picture, broad-strokes “how does SSA tend to fail?” discussion, discussing potential commonalities between specific counterexamples, before enumerating the counterexamples in detail. Right now, “eleven ways the SSA could fail” feels like a grab-bag of considerations.