gwern comments on Multi-dimensional rewards for AGI interpretability and control