speck1447 comments on eggsyntax’s Shortform

speck1447 30 Dec 2025 18:07 UTC
3 points
0
I’m having trouble coming up with a concrete scenario that this defends against. The model needs to be misaligned enough that we can’t trust it with alignment, smart enough to develop fake alignment tools which fool human overseers but actually reliably align the resulting model to its goals rather than ours, but not smart enough to realize when a problem its solving is a crucial step in the true alignment project? I expect alignment to have several irreducibly-hard parts and I expect those parts to be characteristic of alignment in particular, so this doesn’t seem like a coherent combination of capabilities to me.