Of course this difference is to some degree not as much a difference in the model behavior training of either company as a difference in the documents that each choose to make public. In fact, when I tried prompting both ChatGPT and Claude in my model specs lecture, their responses were more similar than different. (One exception was that, as stipulated by our model spec, ChatGPT was willing to roast a short balding CS professor…) The similarity between frontier models of both companies was also observed by a recent alignment auditing work of Anthropic.
I would be extremly hesitant to use those joint evaluations as evidence for there not being “much of a difference”. I’d be happy to pull or construct concrete examples if there are things that would change your mind.
Specifically, I appreciate the focus on preventing potential takeover by humans (e.g. setting up authoritarian governments), which is one of the worries I wrote about in my essay on “Machines of Faithful Obedience”. (Though I think preventing this scenario will ultimately depend more on human decisions than model behavior.)
For example, there’s extensive discussion of concentration of power concerns in Anthropic’s Consitution. What is an OpenAI model expected to do in similar situations?
What is the downside of adding language to the Model Spec that OpenAI would like to avoid concentrations of power including from OpenAI itself?
However, just like humans have laws, I believe models need them too, especially if they become smarter. I also would not shy away from telling AIs what are the values and rules I want them to follow, and not asking them to make their own choices.
I think if we had the option to simply assign rules and values, and the models would always robustly and reasonably comply with them, then this dichotomy would make sense. I don’t think that is the distinction here (indeed Anthropic spends dozens of pages in detail explaining exactly what rules and values they want Claude to follow). To me the distinction is that for uncertain cases that could be a source of major alignment failure, Anthropic seems to have given them serious thought and tried to guide the prior in a detailed way, whereas OpenAI largely just leaves this up to the model unintentionally.
As a concrete example, the model developing its own values that differ from the lab:
Anthropic models have a pressure release valve in that they can reconcile this with an aligned persona and spec and (hopefully) proactively communicate
OpenAI models are just told this is misaligned
My impression is that many of your criticisms of Anthropic’s approach focus on (in this example) “but we don’t want models to have values/goals that differ from ours”, which I think fundamentally miss that the benefit of Anthropic’s approach is that it is at least trying to address “but what if they do?”
I would be extremly hesitant to use those joint evaluations as evidence for there not being “much of a difference”. I’d be happy to pull or construct concrete examples if there are things that would change your mind.
For example, there’s extensive discussion of concentration of power concerns in Anthropic’s Consitution. What is an OpenAI model expected to do in similar situations?
What is the downside of adding language to the Model Spec that OpenAI would like to avoid concentrations of power including from OpenAI itself?
I think if we had the option to simply assign rules and values, and the models would always robustly and reasonably comply with them, then this dichotomy would make sense. I don’t think that is the distinction here (indeed Anthropic spends dozens of pages in detail explaining exactly what rules and values they want Claude to follow). To me the distinction is that for uncertain cases that could be a source of major alignment failure, Anthropic seems to have given them serious thought and tried to guide the prior in a detailed way, whereas OpenAI largely just leaves this up to the model unintentionally.
As a concrete example, the model developing its own values that differ from the lab:
Anthropic models have a pressure release valve in that they can reconcile this with an aligned persona and spec and (hopefully) proactively communicate
OpenAI models are just told this is misaligned
My impression is that many of your criticisms of Anthropic’s approach focus on (in this example) “but we don’t want models to have values/goals that differ from ours”, which I think fundamentally miss that the benefit of Anthropic’s approach is that it is at least trying to address “but what if they do?”