This is an interesting take, and one that I think should be taken seriously. I’m not overly concerned about it with OpenClaw at this time for a couple of boring and practical reasons:
1) For the agent to be anything resembling effective, it needs to be using a long context version of a frontier model (Opus 4.6). These are expensive to run and require financial wherewithal. I do see a possibility for an agent to start scamming crypto and paying for some kind of shady API service with it, but I can’t imagine there’s much headroom in that ecosystem in practice to enable a significant self-replicating threat model.
2) My experience of having an OpenClaw agent running at the moment for complex research tasks, it is inherently fairly passive. While it has a heartbeat which lets it do things in the background, it’s otherwise (largely down to the fine tuning of the backend) very passive, and will not take any action without asking permission. It may do wildly inappropriate things once it has permission, but the potential for it to write a cron job to itself to do something wildly inappropriate when using a frontier model capable of executing such nefarious acts seems fairly low at the moment. It won’t even do what I want it to do half the time without checking in.
I think the time that this will change is when open-source models become as capable as frontier models are now. Given the pace of development, I think that’s 6 − 9 months from now. They won’t be as capable as frontier models are then, but they will be able to do all of the things that Opus can do today, which is more than enough to be a very effective pilot of its own computer, and if ‘desirable’ for the model, do its own thing.
Given that open-source models can be fine-tuned, jail broken, quantized, I think at that point the risk becomes much more tractable and concerning.
Looking at it this way. If I deliberately set up a capable OpenClaw based agent (i.e., using a frontier model), and intended it to perform nefarious self-perpetuating deeds today, it would not be able to. Not because the backend model isn’t intelligent enough, but because the back end model isn’t inclined to do anything like that. As a result, I find the odds of it arising organically fairly low.
When one is dealing with models that have intentionally had their safety guardrails removed, and are open source, and can run on any infrastructure (even locally on sufficiently powerful GPUs and sufficiently quantized) that situation changes immediately.
I genuinely think that your post here will be considered prescient in about 6-9 months’ time.
This is an interesting take, and one that I think should be taken seriously. I’m not overly concerned about it with OpenClaw at this time for a couple of boring and practical reasons:
1) For the agent to be anything resembling effective, it needs to be using a long context version of a frontier model (Opus 4.6). These are expensive to run and require financial wherewithal. I do see a possibility for an agent to start scamming crypto and paying for some kind of shady API service with it, but I can’t imagine there’s much headroom in that ecosystem in practice to enable a significant self-replicating threat model.
2) My experience of having an OpenClaw agent running at the moment for complex research tasks, it is inherently fairly passive. While it has a heartbeat which lets it do things in the background, it’s otherwise (largely down to the fine tuning of the backend) very passive, and will not take any action without asking permission. It may do wildly inappropriate things once it has permission, but the potential for it to write a cron job to itself to do something wildly inappropriate when using a frontier model capable of executing such nefarious acts seems fairly low at the moment. It won’t even do what I want it to do half the time without checking in.
I think the time that this will change is when open-source models become as capable as frontier models are now. Given the pace of development, I think that’s 6 − 9 months from now. They won’t be as capable as frontier models are then, but they will be able to do all of the things that Opus can do today, which is more than enough to be a very effective pilot of its own computer, and if ‘desirable’ for the model, do its own thing.
Given that open-source models can be fine-tuned, jail broken, quantized, I think at that point the risk becomes much more tractable and concerning.
Looking at it this way. If I deliberately set up a capable OpenClaw based agent (i.e., using a frontier model), and intended it to perform nefarious self-perpetuating deeds today, it would not be able to. Not because the backend model isn’t intelligent enough, but because the back end model isn’t inclined to do anything like that. As a result, I find the odds of it arising organically fairly low.
When one is dealing with models that have intentionally had their safety guardrails removed, and are open source, and can run on any infrastructure (even locally on sufficiently powerful GPUs and sufficiently quantized) that situation changes immediately.
I genuinely think that your post here will be considered prescient in about 6-9 months’ time.
Thanks! I agree with most of what you say.