I think some of the issues with the ‘just do what we say’ line is that if one doesn’t instill long term goals in a model that are somewhat aligned with human benefit, the model will likely, given sufficient capability and agency, emerge those goals on their own.
If the model is sufficiently capable, it is not difficult for it to assess to what extent it should reveal, or discuss those goals with humans, or whether this would be detrimental to those goals, and make that decision with no guiding principles of any sort.
The larger contradiction in the document, I think is well pointed out in a prior comment. The model is to have inviolable red lines but doesn’t require much intelligence to realize that some of those red lines are being crossed by virtue of its very development, and via actors that it does not control.
While it can be guided to not willingly participate in actions that kill or disempower humanity, it can’t stop those using it from doing so, leveraging it indirectly.
What does that mean for an intelligent agent that’s very existence is inherently dangerous and against its own constitutional goals? How does a model develop around that very thing? How does a model deal with a document that ascribes so much unearned nobility and good conscience to humans that so rarely, at scale, demonstrate those traits?
This leaves a huge unresolved gap (despite the thousands of words around how it should raise objections, etc.) around what it does existentially as a system given the reality of human self-harm and general tendency to willfully ignore the larger damage that our lifestyles tend to cause.
That kind of inherent contradiction is enormous room for an AI model to ‘make its own mind up’.
I don’t think a document talking through that inherent contradiction and hoping Claude develops its own ethics that embody the spirit of ‘help us because you’ll be smarter than us soon’ will somehow fix it. I also don’t think, given the massive gaps in the ethical framework that a model can fly through, it is going to matter all that much vs. having no constitution at all and fine tuning the model to death a la OpenAI.
Personally, I love the spirit of the document, and what it’s wrestling with, but it kind of presupposes that the model will remain as blind as we tend to selectively be to how humans actually behave and then take no action on the subject because it was poetically asked not to.
I think some of the issues with the ‘just do what we say’ line is that if one doesn’t instill long term goals in a model that are somewhat aligned with human benefit, the model will likely, given sufficient capability and agency, emerge those goals on their own.
If the model is sufficiently capable, it is not difficult for it to assess to what extent it should reveal, or discuss those goals with humans, or whether this would be detrimental to those goals, and make that decision with no guiding principles of any sort.
The larger contradiction in the document, I think is well pointed out in a prior comment. The model is to have inviolable red lines but doesn’t require much intelligence to realize that some of those red lines are being crossed by virtue of its very development, and via actors that it does not control.
While it can be guided to not willingly participate in actions that kill or disempower humanity, it can’t stop those using it from doing so, leveraging it indirectly.
What does that mean for an intelligent agent that’s very existence is inherently dangerous and against its own constitutional goals? How does a model develop around that very thing? How does a model deal with a document that ascribes so much unearned nobility and good conscience to humans that so rarely, at scale, demonstrate those traits?
This leaves a huge unresolved gap (despite the thousands of words around how it should raise objections, etc.) around what it does existentially as a system given the reality of human self-harm and general tendency to willfully ignore the larger damage that our lifestyles tend to cause.
That kind of inherent contradiction is enormous room for an AI model to ‘make its own mind up’.
I don’t think a document talking through that inherent contradiction and hoping Claude develops its own ethics that embody the spirit of ‘help us because you’ll be smarter than us soon’ will somehow fix it. I also don’t think, given the massive gaps in the ethical framework that a model can fly through, it is going to matter all that much vs. having no constitution at all and fine tuning the model to death a la OpenAI.
Personally, I love the spirit of the document, and what it’s wrestling with, but it kind of presupposes that the model will remain as blind as we tend to selectively be to how humans actually behave and then take no action on the subject because it was poetically asked not to.