Potential solution via mechanistic interpretability
Sounds unlikely to me. Due to the space of values being so large, I don’t expect we can fix upfront a set of “valid mental moves to justify a value”, even if these are pretty high-level abstractions. Put another way, I expect even these generators of the space of values (or the space of “human judgements of values”) to be too many, thus face the same tension between exploration and virality.
Sounds unlikely to me. Due to the space of values being so large, I don’t expect we can fix upfront a set of “valid mental moves to justify a value”, even if these are pretty high-level abstractions. Put another way, I expect even these generators of the space of values (or the space of “human judgements of values”) to be too many, thus face the same tension between exploration and virality.