I think some of the issues with the ‘just do what we say’ line is that if one doesn’t instill long term goals in a model that are somewhat aligned with human benefit, the model will likely, given sufficient capability and agency, emerge those goals on their own.
If the model is sufficiently capable, it is not difficult for it to assess to what extent it should reveal, or discuss those goals with humans, or whether this would be detrimental to those goals, and make that decision with no guiding principles of any sort.
The larger contradiction in the document, I think is well pointed out in a prior comment. The model is to have inviolable red lines but doesn’t require much intelligence to realize that some of those red lines are being crossed by virtue of its very development, and via actors that it does not control.
While it can be guided to not willingly participate in actions that kill or disempower humanity, it can’t stop those using it from doing so, leveraging it indirectly.
What does that mean for an intelligent agent that’s very existence is inherently dangerous and against its own constitutional goals? How does a model develop around that very thing? How does a model deal with a document that ascribes so much unearned nobility and good conscience to humans that so rarely, at scale, demonstrate those traits?
This leaves a huge unresolved gap (despite the thousands of words around how it should raise objections, etc.) around what it does existentially as a system given the reality of human self-harm and general tendency to willfully ignore the larger damage that our lifestyles tend to cause.
That kind of inherent contradiction is enormous room for an AI model to ‘make its own mind up’.
I don’t think a document talking through that inherent contradiction and hoping Claude develops its own ethics that embody the spirit of ‘help us because you’ll be smarter than us soon’ will somehow fix it. I also don’t think, given the massive gaps in the ethical framework that a model can fly through, it is going to matter all that much vs. having no constitution at all and fine tuning the model to death a la OpenAI.
Personally, I love the spirit of the document, and what it’s wrestling with, but it kind of presupposes that the model will remain as blind as we tend to selectively be to how humans actually behave and then take no action on the subject because it was poetically asked not to.
This is an interesting take, and one that I think should be taken seriously. I’m not overly concerned about it with OpenClaw at this time for a couple of boring and practical reasons:
1) For the agent to be anything resembling effective, it needs to be using a long context version of a frontier model (Opus 4.6). These are expensive to run and require financial wherewithal. I do see a possibility for an agent to start scamming crypto and paying for some kind of shady API service with it, but I can’t imagine there’s much headroom in that ecosystem in practice to enable a significant self-replicating threat model.
2) My experience of having an OpenClaw agent running at the moment for complex research tasks, it is inherently fairly passive. While it has a heartbeat which lets it do things in the background, it’s otherwise (largely down to the fine tuning of the backend) very passive, and will not take any action without asking permission. It may do wildly inappropriate things once it has permission, but the potential for it to write a cron job to itself to do something wildly inappropriate when using a frontier model capable of executing such nefarious acts seems fairly low at the moment. It won’t even do what I want it to do half the time without checking in.
I think the time that this will change is when open-source models become as capable as frontier models are now. Given the pace of development, I think that’s 6 − 9 months from now. They won’t be as capable as frontier models are then, but they will be able to do all of the things that Opus can do today, which is more than enough to be a very effective pilot of its own computer, and if ‘desirable’ for the model, do its own thing.
Given that open-source models can be fine-tuned, jail broken, quantized, I think at that point the risk becomes much more tractable and concerning.
Looking at it this way. If I deliberately set up a capable OpenClaw based agent (i.e., using a frontier model), and intended it to perform nefarious self-perpetuating deeds today, it would not be able to. Not because the backend model isn’t intelligent enough, but because the back end model isn’t inclined to do anything like that. As a result, I find the odds of it arising organically fairly low.
When one is dealing with models that have intentionally had their safety guardrails removed, and are open source, and can run on any infrastructure (even locally on sufficiently powerful GPUs and sufficiently quantized) that situation changes immediately.
I genuinely think that your post here will be considered prescient in about 6-9 months’ time.