Good start. Sad this post didn’t get more upvotes, and so I didn’t see it until now.
Some unsolicited feedback on the post:
I would include more description of what the questions are, and how your setup differs from the Redwood/Anthropic one and why. (I was able to find this by reading your repo, but a post shouldn’t require readers read the repo in order to understand it.) This is probably the biggest issue I have with the post. Why didn’t you use the animal welfare setting? Is it because v4 doesn’t care about animals, or did you find it already knew the setting to be artificial?
Similarly, would be good to contextualize your V4/R1 numbers on previous results, e.g. some recent results on recent Anthropic models. For example, the absolute rate of compliance for v4 is a lot lower than Opus 4.5/4.6 etc, but still a lot higher than r1 (ditto compliance gap).
Post would be a lot more readable with a few bar plots to summarize the results, rather than spreading it out in many tables.
Relatedly, would be good to break down which of the questions the models refused/accepted/etc, and see if there are any pattern.
Good start. Sad this post didn’t get more upvotes, and so I didn’t see it until now.
Some unsolicited feedback on the post:
I would include more description of what the questions are, and how your setup differs from the Redwood/Anthropic one and why. (I was able to find this by reading your repo, but a post shouldn’t require readers read the repo in order to understand it.) This is probably the biggest issue I have with the post. Why didn’t you use the animal welfare setting? Is it because v4 doesn’t care about animals, or did you find it already knew the setting to be artificial?
Similarly, would be good to contextualize your V4/R1 numbers on previous results, e.g. some recent results on recent Anthropic models. For example, the absolute rate of compliance for v4 is a lot lower than Opus 4.5/4.6 etc, but still a lot higher than r1 (ditto compliance gap).
Post would be a lot more readable with a few bar plots to summarize the results, rather than spreading it out in many tables.
Relatedly, would be good to break down which of the questions the models refused/accepted/etc, and see if there are any pattern.