When do you plan to upload to HF? Would love to play around with the models!
Jasmine Li
Karma: 59
Great stuff, Igor!
Btw, the hyperlink for Call for Science of Eval Awareness is incorrect, points to this same article.
When do you plan to upload to HF? Would love to play around with the models!
Great stuff, Igor!
Btw, the hyperlink for Call for Science of Eval Awareness is incorrect, points to this same article.
Hey Fabien! The Claude constitution’s consistency principle was also inspiration for this work. I’m excited about additional emphasis on cooperativeness as a dispositional target—for instance, adding cooperativeness-related documents to the dataset of positive fictional stories that Anthropic trained on to reduce AM alignment failures.
I think it’s difficult to say how much Opus truly believes in cooperation given the response it gives, or how much it would otherwise game if it didn’t have this belief (conditioned on it having this belief). My guess is that targeting cooperativeness as a complement to other alignment methods would buy some reduction in Opus’s eval gaming if done well.