Hey Fabien! The Claude constitution’s consistency principle was also inspiration for this work. I’m excited about additional emphasis on cooperativeness as a dispositional target—for instance, adding cooperativeness-related documents to the dataset of positive fictional stories that Anthropic trained on to reduce AM alignment failures.
I think it’s difficult to say how much Opus truly believes in cooperation given the response it gives, or how much it would otherwise game if it didn’t have this belief (conditioned on it having this belief). My guess is that targeting cooperativeness as a complement to other alignment methods would buy some reduction in Opus’s eval gaming if done well.
Hey Fabien! The Claude constitution’s consistency principle was also inspiration for this work. I’m excited about additional emphasis on cooperativeness as a dispositional target—for instance, adding cooperativeness-related documents to the dataset of positive fictional stories that Anthropic trained on to reduce AM alignment failures.
I think it’s difficult to say how much Opus truly believes in cooperation given the response it gives, or how much it would otherwise game if it didn’t have this belief (conditioned on it having this belief). My guess is that targeting cooperativeness as a complement to other alignment methods would buy some reduction in Opus’s eval gaming if done well.