Steve Kommrusch comments on Alignment remains a hard, unsolved problem

Steve Kommrusch 9 Dec 2025 22:19 UTC
2 points
0
That idea of catching bad mesa-objectives during training sounds key and I presume fits under the ‘generalization science’ and ‘robust character training’ from Evan’s original post. In the US, NIST is working to develop test, evaluation, verification and validation standards for AI and it would be good to include this concept into that effort.
- Yoav Hollander 10 Dec 2025 17:11 UTC
  1 point
  0
  Parent
  Right – I was also associating this in my mind with his ‘generalization science’ suggestion. However, I think he mainly talks about measuring / predicting generalization (and so does the referenced “Influence functions” post). My main thrust (see also the paper) is a principled methodology for causing the model to “generalize in the direction we want”.