Insofar as character training could be shown to be a superior approach than a spec, one might hope for Anthropic to publish papers about what they’re doing technically and how they know it works. Is it just supervised learning on the text of the Constitution, to shape the model’s latent concept of “Claude”, or is there more to it? (Does having the Constitution in context during reinforcement learning do anything special?)
A recent example that comes to mind: Evan Hubinger is an author on this paper on Open Character Training. I don’t know how much it reveals about what Anthropic themselves are doing though.
A recent example that comes to mind: Evan Hubinger is an author on this paper on Open Character Training. I don’t know how much it reveals about what Anthropic themselves are doing though.