Martín Soto comments on Balancing exploration and resistance to memetic threats after AGI

Martín Soto 7 Aug 2025 19:42 UTC
2 points
0
Potential solution via mechanistic interpretability
Sounds unlikely to me. Due to the space of values being so large, I don’t expect we can fix upfront a set of “valid mental moves to justify a value”, even if these are pretty high-level abstractions. Put another way, I expect even these generators of the space of values (or the space of “human judgements of values”) to be too many, thus face the same tension between exploration and virality.