Constraint-based safety has a ceiling — and a recurring failure mode reveals where

Across several well-known AI harm cases, I think we may be overlooking a recurring structural pattern.

Amazon’s hiring algorithm discriminated against women. Medical AI produced racially biased recommendations. Chatbots deepened dependency in vulnerable users. Agents overwrote files they were never supposed to touch.

Each case has been attributed to its own cause: biased training data, insufficient guardrails, user overreliance, inadequate testing. We treat them as separate problems requiring separate fixes.

I do not claim these cases reduce to a single cause. But I do think many of them share a recurring failure mode that becomes much easier to see once named.


The shared pattern: safety mechanisms that do not evaluate whether an action risks hard-to-reverse harm

In practice, deployed AI systems are governed by a mix of learned objectives, prompts, safety filters, policies, and action constraints. The question is which checks are prioritized.

There is a useful distinction between two kinds of safety mechanism:

Enumerated constraints — rules of the form “do not do X.” Don’t produce harmful content. Don’t share private data. Don’t take irreversible actions without confirmation.

Pre-act checks — procedures that, before acting, ask whether the action is irreversible, high-impact, or poorly understood.

Much practical safety engineering has emphasized constraint-based layers because they are legible, testable, and auditable. This is necessary, but not sufficient when capabilities shift faster than enumerated safeguards can be updated.

Enumerated constraints have a structural limitation: they are written for anticipated failure states. The moment a novel context emerges — a new model capability, an unexpected use pattern, a compounding interaction between systems — the constraint never covered it, because it was never designed to.

The very rule meant to protect becomes the gap through which novel harms emerge.


Re-reading familiar harm cases through this lens

These summaries do not aim to reduce complex failures to a single cause. They highlight a specific, often-missing kind of check: a pre-act evaluation of whether the action risks hard-to-reverse or hard-to-remediate harm.

Amazon’s hiring algorithm was not only learning bias from historical data; it was optimizing a proxy without a corresponding check on whether that proxy was systematically excluding qualified candidates in ways that were hard to remediate. A pre-act check alone would not have prevented this harm — surfacing the question requires measurement infrastructure that did not exist. But the question would have preceded and motivated building that infrastructure.

Medical AI racial bias: constraints were applied to model outputs, not to the training process. The question does this systematically disadvantage a group in ways that patients or clinicians cannot realistically correct downstream? remained a principle rather than an operational check. Again, answering the question requires measurement capability — but without the question in the architecture, there was no reason to build that capability.

Chatbot dependency: utterance-level safety filters were active, while the relevant risk lived in the long-term pattern of interaction rather than in any single response. No check asked: does the cumulative pattern of engagement cause harm that is difficult to reverse? Operationalizing this check remains an open problem.

Agentic file overwrite: the agent was executing its task. No constraint said “stop.” No check asked, Is this action reversible? This is the cleanest case — the pre-act check requires no new measurement infrastructure. The question alone, if present, would have been sufficient.

The common pattern: safety mechanisms that addressed anticipated harms, while failing to evaluate unanticipated hard-to-reverse or hard-to-remediate consequences. The first two cases suggest that the check is necessary but not sufficient on its own; the last two suggest that, in bounded operational settings, it can be sufficient on its own. The distinction matters.


The structural limits of safety-as-constraint

When safety is implemented as a layer added from outside — a filter, a guardrail, a policy document — it is necessarily backward-looking. It encodes the failure states we have already imagined.

Regulatory frameworks often share this architecture. The EU AI Act categories, content moderation standards, and platform safety policies all attempt to enumerate unacceptable outcomes and proscribe pathways toward them. Useful, but inevitably lagging in open-ended domains.

These cases are not primarily evidence of bad rules. They suggest a ceiling to constraint-only architectures.


Pre-act checks as a complementary layer

What would it take to center safety on a standing check rather than an ever-growing list of prohibitions?

Not “do not break things” — but “before proceeding: is this action high-impact? Is it reversible? What cannot be rebuilt if this goes wrong?”

A pre-act check does not need to enumerate specific failure states in advance. If the underlying predicate is broad enough — irreversibility, unacceptable side effects, high uncertainty — it may generalize better than action-specific prohibitions to novel contexts. It still depends, of course, on the system being able to recognize those predicates reliably.

This is not an entirely novel observation. Corrigibility, conservative agency, and reversibility-preserving design all point in this direction. Impact regularization and attainable utility preservation formalize related ideas. Constitutional AI can be read as moving partly in this direction, by turning principles into model-generated critique and revision steps.

What I am suggesting is that this pattern — the shift from enumerated prohibition to situational inquiry — deserves to be recognized as a general architectural principle, not just a technique within specific systems. And that the absence of this layer is a recurring contributor to how harms slip through in deployed systems.


One implementation sketch

In a smaller and very different domain, I have been building a system for preserving AI persona continuity across sessions.[^1] The problem space — agent actions that silently destroy state the user cannot recover — is narrow, but it is structurally the same class of failure. I do not mean this as a general solution. But it illustrates the distinction I am pointing to here.

Instead of hard-coded prohibitions like “don’t delete files” or “don’t overwrite personas,” the system prioritizes a pre-act check that gates any significant state change:

Is this action irreversible? If so, stop and confirm.

This kind of check is meant to reduce risk across prompt injection, file overwrites, session discontinuity, and persona drift — not by anticipating each failure mode in advance, but by foregrounding reversibility before action. It composes with constraint layers rather than replacing them.

[^1]: Companion Garden Template, a small open-source AI companion infrastructure.


Limits and open problems

Pre-act checks are not a complete solution. Several hard problems remain:

  • Models can misjudge whether an action is irreversible. The check is only as good as the system’s ability to evaluate the predicate.

  • Confirmation loops can be gamed or become rubber-stamped through fatigue.

  • Some harms are path-dependent — no single action is irreversible, but the cumulative trajectory is. The chatbot dependency case is an example.

  • Not all safety-relevant harms reduce to irreversibility. Reversible actions can still be ethically unacceptable — discriminatory allocation, for instance. Which predicates should be treated as trigger conditions — irreversibility, uncertainty, side effects, or something else — is itself an open question.

Partial mitigations exist — reversible-by-default design, human-in-the-loop escalation, logging for post-hoc calibration — but none of them eliminate the underlying evaluation problem.

The predicate used throughout this post is primarily irreversibility. This is a deliberate simplification; a full pre-act check layer would likely need multiple predicates.


The question this post is asking

Not: here is the solution.

But: do we have the right architecture for thinking about AI safety?

If safety is built as constraint enumeration alone, it will always trail the capability frontier. If it includes a pre-act inquiry layer — a context-independent check applied before action — it has a better chance of being robust to things we have not anticipated yet.

This is not a reproducible vulnerability or a policy proposal. It is a design-level framing offered for criticism.

Open question: which predicates should a safety system treat as trigger conditions for stopping, escalating, or asking for confirmation — and how do we prevent those predicates from being Goodharted in turn?

No comments.