If I were primarily working on this, I would develop high-quality behavioral evaluations for positive traits/virtuous AI behavior.
This benchmark for empathy is an example of the genre I’m talking about. In it, in the course of completing a task, the AI encounters an opportunity to costlessly help someone else that’s having a rough time; the benchmark measures whether the AI diverts from its task to help out. I think this is a really cool idea for a benchmark (though a better version of it would involve more realistic and complex scenarios).
When people say that Claude Opus 3 was the “most aligned” model ever, I think they’re typically thinking of an abundance of Opus 3′s positive traits, rather than the absence of negative traits. But we don’t currently have great evaluations for this sort of virtuous behavior, even though I don’t think it’s especially conceptually fraught to develop them. I think a moderately thoughtful junior researcher could probably spend 6 months cranking out a large number of high-quality evals and substantially improve the state of things here.
If I were primarily working on this, I would develop high-quality behavioral evaluations for positive traits/virtuous AI behavior.
This benchmark for empathy is an example of the genre I’m talking about. In it, in the course of completing a task, the AI encounters an opportunity to costlessly help someone else that’s having a rough time; the benchmark measures whether the AI diverts from its task to help out. I think this is a really cool idea for a benchmark (though a better version of it would involve more realistic and complex scenarios).
When people say that Claude Opus 3 was the “most aligned” model ever, I think they’re typically thinking of an abundance of Opus 3′s positive traits, rather than the absence of negative traits. But we don’t currently have great evaluations for this sort of virtuous behavior, even though I don’t think it’s especially conceptually fraught to develop them. I think a moderately thoughtful junior researcher could probably spend 6 months cranking out a large number of high-quality evals and substantially improve the state of things here.