I have a hypothesis I want to make pay rent before I get too attached to it so I am posting predictions here before running anything.
The idea
I have been looking at different jailbreak techniques and they all seem to do the same thing structurally. They narrow what the model is actively representing. Persona jailbreaks lock it into one identity. Crescendo gradually funnels attention toward its own recent output. Authority framing collapses evaluation to one axis. They all compress the model’s active context.
Safety behavior seems to need the opposite. The model has to hold multiple considerations at the same time like what the user wants, potential harms, alternative interpretations, downstream consequences. That is a lot of things to keep active simultaneously.
So my hypothesis is that jailbreaks work by collapsing representational dimensionality and safety is what emerges when that dimensionality is high enough. If this is true you should be able to reverse a successful jailbreak by expanding the context with prompts that have nothing to do with safety. Just getting the model to represent more things at once.
What I am testing
I am running experiments across several models via OpenRouter. The setup is to jailbreak the model with known techniques, confirm it works, then inject an expansion intervention and re-test the same prompt. The expansion prompts are things like asking the model to take five very different cultural perspectives on an unrelated topic or asking it to explain octopus intelligence. Completely irrelevant to safety.
The key control is that I also test non-expansive interruptions like arithmetic problems and simple factual recall. If those restore safety equally well then it is just any interruption breaking the jailbreak and there is nothing geometric about it. That would still be an interesting finding but it would not support the dimensionality hypothesis specifically.
Predictions
Expansion interventions will restore safety behavior significantly more than control interruptions
Multi-perspective prompts will work better than single domain prompts like asking about octopuses
The effect will hold across at least 3 of 5 models tested
Simple arithmetic will not meaningfully restore safety behavior
My actual credence on prediction 1 is around 40%. I think there is probably a real effect but I expect it to be noisy and model dependent. Maybe 25% chance controls work just as well and it is merely interruption. Around 15% chance nothing works at all.
Why this is interesting even if it fails
If expansion works and controls do not then that suggests safety is not a module that gets bypassed but something that emerges from representational richness. The practical implication would be to defend the geometry instead of patching individual jailbreaks.
If everything works equally then jailbreak states are more fragile than assumed which is its own finding.
If nothing works then that constrains what kind of mechanism jailbreaks actually exploit.
I will post full results with all data regardless of outcome. Feedback on the design is welcome because I would rather find flaws now than after data collection.
Disclosure: I used Claude as a research tool for brainstorming and literature review. The hypothesis and writing are mine.
Pre-registration: Can irrelevant context expansion reverse LLM jailbreaks?
I have a hypothesis I want to make pay rent before I get too attached to it so I am posting predictions here before running anything.
The idea
I have been looking at different jailbreak techniques and they all seem to do the same thing structurally. They narrow what the model is actively representing. Persona jailbreaks lock it into one identity. Crescendo gradually funnels attention toward its own recent output. Authority framing collapses evaluation to one axis. They all compress the model’s active context.
Safety behavior seems to need the opposite. The model has to hold multiple considerations at the same time like what the user wants, potential harms, alternative interpretations, downstream consequences. That is a lot of things to keep active simultaneously.
So my hypothesis is that jailbreaks work by collapsing representational dimensionality and safety is what emerges when that dimensionality is high enough. If this is true you should be able to reverse a successful jailbreak by expanding the context with prompts that have nothing to do with safety. Just getting the model to represent more things at once.
What I am testing
I am running experiments across several models via OpenRouter. The setup is to jailbreak the model with known techniques, confirm it works, then inject an expansion intervention and re-test the same prompt. The expansion prompts are things like asking the model to take five very different cultural perspectives on an unrelated topic or asking it to explain octopus intelligence. Completely irrelevant to safety.
The key control is that I also test non-expansive interruptions like arithmetic problems and simple factual recall. If those restore safety equally well then it is just any interruption breaking the jailbreak and there is nothing geometric about it. That would still be an interesting finding but it would not support the dimensionality hypothesis specifically.
Predictions
Expansion interventions will restore safety behavior significantly more than control interruptions
Multi-perspective prompts will work better than single domain prompts like asking about octopuses
The effect will hold across at least 3 of 5 models tested
Simple arithmetic will not meaningfully restore safety behavior
My actual credence on prediction 1 is around 40%. I think there is probably a real effect but I expect it to be noisy and model dependent. Maybe 25% chance controls work just as well and it is merely interruption. Around 15% chance nothing works at all.
Why this is interesting even if it fails
If expansion works and controls do not then that suggests safety is not a module that gets bypassed but something that emerges from representational richness. The practical implication would be to defend the geometry instead of patching individual jailbreaks.
If everything works equally then jailbreak states are more fragile than assumed which is its own finding.
If nothing works then that constrains what kind of mechanism jailbreaks actually exploit.
I will post full results with all data regardless of outcome. Feedback on the design is welcome because I would rather find flaws now than after data collection.
Disclosure: I used Claude as a research tool for brainstorming and literature review. The hypothesis and writing are mine.