I think a different use of MI is warranted here. While I highly doubt the ability to differentiate whether a value system is meshing well with someone for “good” or “bad” reasons, it seems more plausible to me that you could measure the reversibility of a value system.
The distinguishing feature of a trap here isn’t so much the badness, as the fact that it’s irreversible. If you used interpretability techniques to check whether someone could be reprogrammed from a belief, you’d avoid a lot of tricky situations.
I think a different use of MI is warranted here. While I highly doubt the ability to differentiate whether a value system is meshing well with someone for “good” or “bad” reasons, it seems more plausible to me that you could measure the reversibility of a value system.
The distinguishing feature of a trap here isn’t so much the badness, as the fact that it’s irreversible. If you used interpretability techniques to check whether someone could be reprogrammed from a belief, you’d avoid a lot of tricky situations.