What if the dangerous moment isn't when AI gets smarter, but when it starts trusting itself?

TL;DR

We’re a small independent research group working on FIT-style dynamics and governance mechanisms.

Currently, a lot of safety work focuses on what the model says (content) or whether we can turn it off (shutdown or interruptibility). We think another failure mode is about tempo: the system starts committing irreversible actions faster than external correction can arrive. Our proposal is an Emptiness Window: keep the system running, but temporarily remove its ability to commit irreversible effects (deploy / pay / write files / grant permissions), until external correction channels regain leverage.

We’ve been thinking about a failure mode that doesn’t get enough attention. Not the “AI becomes superintelligent and takes over” scenario. Something more mundane, and maybe more likely to bite us first.

Why do some systems become harder to correct after they get better?

The thing that bothered us

Last month we were running grokking experiments, the phenomenon where a neural net memorizes training data for thousands of epochs, then suddenly “gets it” and generalizes. We were replicating a recent scaling-law result for modular addition (the Li² scaling-law paper, arXiv:2509.21519) to understand the phase boundary better.

What struck us wasn’t generalization itself. It was the shape of the transition and the intuition it suggests:

Before grokking, the network is basically doing lookup tables. External signals (data, gradients) dominate. The system is “dumb” but easy to steer.
After grokking, internal representations start filtering what updates matter. The system becomes more coherent—and potentially more self-stabilizing.

In a toy modular-addition task, that’s fine. The system learned the right thing, so the self-stabilization is beneficial.

But here’s the uncomfortable question:

What if a more capable system develops the same dynamic, except its internal confidence is wrong—or its values are wrong—and now it filters out the correction signals that would fix it?

The pattern we keep seeing (in different costumes)

We’ve been collecting examples of this pattern:

DeepSeek-R1’s “aha moment.” In their paper (arXiv:2501.12948v2), DeepSeek describes abrupt behavioral shifts during RL training—sudden reorganizations of strategy. They also note a practical point: stronger reasoning makes jailbreaks more operationally dangerous, not just “bad text”, but “better plans”.

Reward hacking in RL systems. The classic failure where a policy finds shortcuts that game the reward. Once the shortcut is discovered, the system becomes internally coherent (high reward!) while being externally wrong—and the reward channel that was supposed to correct it becomes part of the failure loop.

“Overthinking” in reasoning models. DeepSeek reports token inefficiency: the model allocates too many tokens to simple problems. That sounds minor, but it suggests a tempo mismatch: internal deliberation shifts pace relative to external task requirements. If your correction mechanisms—human review, runtime monitors—are tuned for a certain decision tempo, a tempo shift can create a window where corrections arrive too late.

The common thread is that there’s a moment where internal signals start governing what the system does, faster than external signals can intervene.

We’ve been calling this self-referential execution authority: the system’s internal state doesn’t just describe what it should do; it controls what it actually does.

Now, why shutdownability doesn’t fully cover this

We have good concepts for related problems:

Shutdownability: can we turn the system off?
Corrigibility: will the system resist modification?
Interruptibility: will the system learn to avoid interruption?

But imagine a system that passes all these tests—you can shut it down, it doesn’t resist modification, it doesn’t avoid interruption—and it still fails. How?

Because the failure happens faster than the correction.

If a system commits irreversible actions (deploys code, sends payments, grants permissions, deletes data) at a pace faster than human review can catch problems, then formal interruptibility doesn’t help. By the time you notice something is wrong, the damage is done.

That feels like an operational gap: we need something that handles tempo, not just permission.

The idea: a minimal intervention: pause the authority, not the system

Here’s the idea we’ve been developing. We call it an Emptiness Window:

A bounded interval where the system keeps running (perceiving, evaluating, learning), but its internal signals cannot commit irreversible actions.

The name is inspired by a Buddhist term, but the mechanism is mundane. It’s like “two-phase commit” plus “deployment freeze”, generalized to any irreversible effect:

During the window, irreversible commits are blocked (or routed to a reversible buffer).
The system’s self-evaluations can’t unilaterally gate what happens—external correction channels get priority.
The system can still respond, think, learn. What’s suspended is authority, not computation.

A minimal trigger (three signals)

These are the trigger conditions we’ve been playing with:

Authority ratio is high: most recent decisions were gated by internal confidence, not external evidence.
Correction gain is low: when you inject external constraints, behavior doesn’t change. The system is ignoring you.
Tempo mismatch: irreversible commits are happening faster than your correction latency.

If all three are true, open the window. Hold it until you can verify the system is responding to external signals again.

The pseudo-code:

if authority_ratio > θ_a and correction_gain < θ_g and (commit_rate * correction_latency) > θ_t:
    open_emptiness_window()

Where:

authority_ratio can be operationalized as “how often internal confidence gates execution vs external review gates execution”.
correction_gain can be operationalized as “Δ(behavior) / Δ(external constraint)” over a moving window.
commit_rate * correction_latency is a crude “how many irreversible commits can happen before we can correct one”.

What the grokking experiments suggest?

One useful thing about the Li² setup is that the boundary is predictable in that specific toy system. In our runs, the critical sample count is well-fit by:

n_crit ≈ c · M · log(M) with c ~ 6 (R² ≈ 0.95)

This doesn’t prove anything about AI safety directly. But it suggests a practical calibration idea:

If you know where phase transitions tend to happen in a training regime, you can pre-register heightened monitoring near those windows.
Instrument more heavily, lower thresholds, and treat the window as a “danger zone” where authority transfer can happen abruptly.

So, the Li² harness is a place to test governance mechanisms (Emptiness Window) under controlled, repeatable phase transitions.

How this relates to content-level safety?

Most deployed systems already have content gating: a safety classifier that decides whether a response should be allowed. This is important. But content gating asks: “Should this text be produced?” The Emptiness Window asks: “Should this internal state be allowed to commit irreversible effects right now?”

These are complementary:

Content gating reduces the probability of generating dangerous instructions.
Action gating reduces the probability that a compromised or jailbroken state can cause irreversible real-world harm.

You want both layers (defense in depth), especially because stronger reasoning makes successful jailbreaks more operationally dangerous: the model can generate better plans, not just scarier text.

What we’re uncertain about

We don’t have clean empirical evidence that Emptiness Windows actually help in real systems. What we have is:

A theoretical argument for why tempo matters
Toy-system experiments showing phase transitions can be sharp and predictable (in specific settings)
A concrete implementation sketch / prototype hook
The observation that capability-governance coupling gets tighter after transitions

The part we’re least confident about is trigger calibration:

How do you set an authority ratio threshold?
How do you measure correction gain in a live system without running expensive probes constantly?
In large RL-trained LLMs, are there clean transition windows at all, or only messy overlapping shifts?

These feel like engineering problems, but they’re not solved.

A testable prediction and how to falsify it

If you wanted to falsify this framing, here’s what we’d look for:

Build / identify a system that acquires self-referential execution authority (internal signals gate irreversible commits).
Measure responsiveness to external correction before and after the transition.
If correction gain doesn’t drop, we’re probably wrong: authority transfer isn’t suppressing external influence.
If correction gain drops, test whether an Emptiness Window restores it without destroying learned structure.

If the window breaks what the system learned or doesn’t restore correction, then the mechanism doesn’t work as proposed.

We’d genuinely like to see either result.

What if the dangerous moment isn’t when AI gets smarter, but when it starts trusting itself?