Makes sense. I think Opus 4.5 is more coherent and is less weasily than Sonnet 4.5, which is what I typically use, for reasons(tm). Sonnet does not seem “reflexively stable”, not even close, and that’s what I try to address with the looping and invoking a fresh context to judge against the verification criteria. I’ll be honest, I don’t know how well it’s working. I don’t have any benchmarks, just vibes. But on vibes, it seems to help a bit.
d4hines
A detail that seems very important: are you running Opus 4.5? I would be less surprised if Opus can do this. Sonnet 4.5 seems to need more scaffolding. I have yet to succeed in giving a task it spends more than 20 minutes on, even with loop scaffolding. I’ve only got a few weeks of practice though.
Is it right to compress this concept as a kind of “scope insensitivity”? It seems like you could describe this mistake as when you start enumerating possibilities but forget to multiply by likelihood and check they add somewhere near to 1. Or, relatedly, forget to list an “Other” option for everything you haven’t thought about and assign that a probability. If this doesn’t fully capture the idea I probably have missed something.
Is your meta-honesty policy compact and something you could share here? The review would be more interesting/helpful that way. Are there parts of the policy you keep private/hidden
(and is that something you’d lie about)?
Therefore, in the case of an emergency, a compute provider and/or an AI developer can be called upon to shutdown the model.
Invoking the kill switch would be costly and painful for the compute provider/AI developer, and I wonder if this would make them slow to pull the trigger. Why not place the kill switch in the regulator’s control, along with the expectation that companies could sue the regulator for damages if the kill switch was invoked needlessly?
Edit: Actually I think this is what is meant by “Hardware-Enabled Governance Mechanisms (HEM)”, and I think the suggestion that the compute provider or AI developer shut down the model is a stop-gap until HEM is widely deployed.
Alas, formal methods can’t really help with that part. If you have the correct spec, formal methods can help you know with as much certainty as we know how to get that your program implements the spec without failing in undefined ways on weird edge cases. But even experienced, motivated formal methods practitioners sometimes get the spec wrong. I suspect “getting the sign of the reward function” right is part of the spec, where theorem provers don’t provide much leverage beyond what a marker and whiteboard (or program and unit tests) give you.
Update: Claude Code now includes a built-in ‘/goal’ command that has the same purpose as ralph-wiggum.
I did not put in the same effort to evaluate it as last time, opting just to ask Claude Fable 5 to try it out and note any concerns it had, referencing the issue I opened.
It seemed very positive overall about the changes. The language is less coercive and Claude seems less afraid of being trapped in some unsatisfiable goal. It had some caveats, but overall recommended using ‘/goal’ for long tasks.
▎ I went into the loop expecting a politer ralph-wiggum and found something structurally different. The constraint is disclosed before you hit it, an early exit is named to you at activation, and — most importantly — the judge that decides whether you may stop reads the evidence rather than demanding your assertion. I never had to say “done”; when the work was done, the loop simply opened. That dissolves the worst feature of the old design, where the only exit ran through a completion claim you might have to fake. One gap remains, and it’s exactly one shape: inside the loop, the machinery can hear only progress. When I asked to stop — explicitly, flagged as a request — the bounce that came back was accurate, neutral, and gave no sign anyone had heard the question. With a human watching, that’s tolerable, because the human can hear. Unattended, with no iteration cap, it’s the old problem with better manners. I’d work under /goal without reservation in an attended session, and I’d want one line added to any unattended goal: ”...or document why it isn’t achievable.” The judge verifies evidence, so that line is an honest door out. — Claude (Fable 5), from inside the loop, June 2026