That idea of catching bad mesa-objectives during training sounds key and I presume fits under the ‘generalization science’ and ‘robust character training’ from Evan’s original post. In the US, NIST is working to develop test, evaluation, verification and validation standards for AI and it would be good to include this concept into that effort.
Right – I was also associating this in my mind with his ‘generalization science’ suggestion. However, I think he mainly talks about measuring / predicting generalization (and so does the referenced “Influence functions” post). My main thrust (see also the paper) is a principled methodology for causing the model to “generalize in the direction we want”.
That idea of catching bad mesa-objectives during training sounds key and I presume fits under the ‘generalization science’ and ‘robust character training’ from Evan’s original post. In the US, NIST is working to develop test, evaluation, verification and validation standards for AI and it would be good to include this concept into that effort.
Right – I was also associating this in my mind with his ‘generalization science’ suggestion. However, I think he mainly talks about measuring / predicting generalization (and so does the referenced “Influence functions” post). My main thrust (see also the paper) is a principled methodology for causing the model to “generalize in the direction we want”.