BryceStansfield comments on Balancing exploration and resistance to memetic threats after AGI

BryceStansfield 7 Aug 2025 21:08 UTC
2 points
0
I think a different use of MI is warranted here. While I highly doubt the ability to differentiate whether a value system is meshing well with someone for “good” or “bad” reasons, it seems more plausible to me that you could measure the reversibility of a value system.
The distinguishing feature of a trap here isn’t so much the badness, as the fact that it’s irreversible. If you used interpretability techniques to check whether someone could be reprogrammed from a belief, you’d avoid a lot of tricky situations.