Nathan Helm-Burger comments on Testbed evals: evaluating AI safety even when it can’t be directly measured