evhub comments on Robustness of Model-Graded Evaluations and Automated Interpretability