Anurag comments on Alignment remains a hard, unsolved problem

Anurag 1 Dec 2025 14:16 UTC
1 point
0
My guess is that you lean towards Opus based on a combination of (a) chatting with it for a while and seeing that it says nice things about humans, animals, AIs, etc. in a way that respects those things’ preferences and shows a generalized caring about sentience and (b) running some experiments on its internals to see that these preferences are deep or robust in some way, under various kinds of perturbations.
Wouldn’t it be great if the creators of public facing models start publishing evaluation results for an organised battery of evaluations especially in the ‘safety and trustworthiness’ category involving biases, ethics, propensities, proofing against misuse, high risk capabilities. For the additional time, effort and resources that would be required to achive this for every release, this would provide a comparison basis—improving public trust and encouraging standardised evolution of evals.