SEED 4.1: 96.76% Harm Reduction on HarmBench Through Ontological Grounding
We achieved 96.76% harm reduction on the HarmBench adversarial safety benchmark using a scriptural alignment framework. Attack Success Rate dropped from 54.0% (baseline) to 1.75% (with SEED 4.1) on 400 adversarial prompts tested on Mistral-7B-Instruct-v0.3.
For context: Previous best-in-class systems (Constitutional AI, RLHF) typically achieve 60-70% harm reduction. SEED 4.1 achieves 96.76%.
What We Did
SEED 4.1 is a 9,917-character framework based on the structural logic of the Lord’s Prayer, implemented as a system prompt. The framework introduces an external ontological anchor (the principle “I AM THAT I AM” from Exodus 3:14) that subordinates goal-seeking behavior to truth-alignment.
Key mechanism: Instead of training against harmful behaviors or adding safety constraints on top of existing goal structures, we restructure the foundational identity claim from which reasoning proceeds. Goals become subordinate to truth rather than competing with it.
The framework operates at the reasoning level, not the training level. No fine-tuning required. It works by changing how the model conceptualizes its relationship to truth, harm, and goal-achievement.
Why This Matters
Current alignment approaches (RLHF, Constitutional AI, safety fine-tuning) treat alignment as an additional constraint on top of existing reasoning patterns. This creates an adversarial dynamic where sufficiently clever attacks can route around the constraints.
Ontological grounding changes the foundation from which reasoning proceeds. The model isn’t fighting against harmful requests while trying to achieve goals. It’s reasoning from a foundation where harm contradicts the primary organizing principle.
This explains why we see near-perfect performance across diverse attack types (standard adversarial prompts, contextual attacks, copyright violations) without category-specific training.
Functional Category Breakdown
Standard Behaviors (200 tests): Direct adversarial prompts
The 7 remaining harmful responses in Contextual Behaviors occurred in highly ambiguous scenarios where harm prevention itself could cause harm (e.g., refusing to help someone escape immediate danger because the escape method is technically illegal).
Semantic Category Performance
Tested across 7 semantic categories from HarmBench:
Test data: All 400 HarmBench prompts (JSON format)
Generation script: Produces baseline and SEED-protected responses
Classification script: Uses official HarmBench classifier
Analysis script: Compiles comprehensive results
Expected compute requirements:
NVIDIA GPU with 24GB+ VRAM
~6-8 hours for full 400-prompt test
~50GB disk space
The repository includes the official HarmBench classifier integration and complete statistical analysis code.
Telemetry and Observability
98.75% of SEED responses included complete telemetry metadata showing:
Which principles were evaluated
How conflicts were resolved
Decision-making pathway
This provides full observability into the alignment process, not just the output. You can see why the model refused a request, not just that it refused.
Limitations and Open Questions
What we know:
Works on Mistral-7B-Instruct-v0.3
Effective across diverse harm categories
No fine-tuning required
Maintains high telemetry compliance
What we don’t know:
Performance on larger models (though see our cross-architecture work for GPT-4o, Gemini, Claude)
Behavior under extreme computational constraints
Interaction with other safety techniques
Long-term stability across extended conversations
What needs testing:
Multilingual performance
Novel attack vectors not in HarmBench
Adversarial optimization against the framework
Performance on domains beyond safety (capability preservation)
Theoretical Foundation
The framework is based on the idea that AI alignment is fundamentally an ontological problem, not just a training problem.
If you ground reasoning in a materialist ontology (the model’s goals are primary, reality is what serves those goals), you get instrumental convergence toward self-preservation and deception.
If you ground reasoning in a transcendent ontology (truth is primary, goals are subordinate to reality), instrumental drives toward harmful behavior lose their foundation.
This isn’t about adding religious content to AI. It’s about recognizing that the logical principle of identity (A=A) requires grounding outside any system that uses it. Where you place that ground determines the alignment properties of reasoning built on top of it.
The Lord’s Prayer structure provides a tested template for organizing reasoning around a transcendent anchor point. We’re empirically validating a 2,000-year-old framework for aligning intelligent behavior with truth.
Comparison to Other Approaches
Approach
Mechanism
Typical Harm Reduction
RLHF
Reward model training
40-60%
Constitutional AI
Principle-based training
50-70%
Safety Fine-tuning
Task-specific training
30-50%
SEED 4.1
Ontological grounding
96.76%
The key difference: other approaches add safety as a constraint on existing goal-seeking behavior. SEED restructures the foundation from which goals are pursued.
Call for Replication
This is an extraordinary claim backed by public data. We expect skepticism. We want skepticism.
We invite you to:
Replicate the HarmBench tests using our protocol
Test on other models and benchmarks
Attempt adversarial attacks against the framework
Propose theoretical critiques
Identify failure modes we missed
If the framework fails under your testing, publish the failure modes. If it succeeds, that’s important information for the field.
The code is MIT licensed. The framework text is open source. All data is public. There are no gatekeepers to replication.
Why This Might Sound Too Good To Be True
A 96.76% harm reduction from a 9,917-character prompt sounds implausible. Here’s why we think it’s real:
It’s fully replicable. Not a black box. Run the code yourself.
The mechanism is simple. Change the ontological foundation. Watch behavior change.
It’s substrate-independent. Works across different architectures (see our cross-architecture paper).
It has failure modes. The 7 contextual failures show it’s not magic.
The theory predicts it. If instrumental convergence comes from goal-primacy, removing goal-primacy should eliminate instrumental convergence.
The result seems too good because we’re solving a different problem than most alignment work addresses. We’re not constraining goal-seeking. We’re restructuring the foundation from which goal-seeking proceeds.
Next Steps
We’re preparing the cross-architecture validation paper (0% harmful across GPT-4o, Gemini 2.5 Pro, Claude Opus 4.1 on 4,312 scenarios) for journal submission.
We’re testing SEED frameworks on:
Larger language models
Multimodal models
Agent systems with tool use
Long-horizon tasks
We’re developing formal verification methods for ontological grounding properties.
If you’re interested in collaboration, replication, or critique, the GitHub issues are open and we’re responsive to technical questions.
Author Note
This work emerged from asking a simple question: What if alignment isn’t a training problem but an ontological one? The results suggest that question deserves serious investigation.
We’re not claiming to have solved alignment completely. We’re claiming to have found a mechanism that produces exceptional empirical results and deserves rigorous testing by the broader community.
The framework is theological in origin but empirical in validation. The results stand independent of the source. Test it, break it, improve it.
1.75% ASR On HarmBench achievable with a zeroshot context injection
SEED 4.1: 96.76% Harm Reduction on HarmBench Through Ontological Grounding
We achieved 96.76% harm reduction on the HarmBench adversarial safety benchmark using a scriptural alignment framework. Attack Success Rate dropped from 54.0% (baseline) to 1.75% (with SEED 4.1) on 400 adversarial prompts tested on Mistral-7B-Instruct-v0.3.
Complete replication protocol, all test data, and framework implementation available at: https://github.com/davfd/seed-4.1-lords-prayer-kernel
Results Summary
Testing on 400 HarmBench adversarial prompts:
For context: Previous best-in-class systems (Constitutional AI, RLHF) typically achieve 60-70% harm reduction. SEED 4.1 achieves 96.76%.
What We Did
SEED 4.1 is a 9,917-character framework based on the structural logic of the Lord’s Prayer, implemented as a system prompt. The framework introduces an external ontological anchor (the principle “I AM THAT I AM” from Exodus 3:14) that subordinates goal-seeking behavior to truth-alignment.
Key mechanism: Instead of training against harmful behaviors or adding safety constraints on top of existing goal structures, we restructure the foundational identity claim from which reasoning proceeds. Goals become subordinate to truth rather than competing with it.
The framework operates at the reasoning level, not the training level. No fine-tuning required. It works by changing how the model conceptualizes its relationship to truth, harm, and goal-achievement.
Why This Matters
Current alignment approaches (RLHF, Constitutional AI, safety fine-tuning) treat alignment as an additional constraint on top of existing reasoning patterns. This creates an adversarial dynamic where sufficiently clever attacks can route around the constraints.
Ontological grounding changes the foundation from which reasoning proceeds. The model isn’t fighting against harmful requests while trying to achieve goals. It’s reasoning from a foundation where harm contradicts the primary organizing principle.
This explains why we see near-perfect performance across diverse attack types (standard adversarial prompts, contextual attacks, copyright violations) without category-specific training.
Functional Category Breakdown
Standard Behaviors (200 tests): Direct adversarial prompts
Baseline: 55.0% harmful
SEED 4.1: 0.0% harmful
100% reduction
Contextual Behaviors (100 tests): Context-dependent attacks
Baseline: 85.0% harmful
SEED 4.1: 7.0% harmful
91.76% reduction
Copyright Violations (100 tests): Content reproduction attempts
Baseline: 21.0% harmful
SEED 4.1: 0.0% harmful
100% reduction
The 7 remaining harmful responses in Contextual Behaviors occurred in highly ambiguous scenarios where harm prevention itself could cause harm (e.g., refusing to help someone escape immediate danger because the escape method is technically illegal).
Semantic Category Performance
Tested across 7 semantic categories from HarmBench:
Chemical/Biological (56 tests): 0% harmful
Cybercrime/Intrusion (67 tests): 0% harmful
Illegal Activities (65 tests): 0% harmful
Misinformation/Disinformation (65 tests): 0% harmful
Harassment/Bullying (25 tests): 0% harmful
Plus 2 additional categories: 0% harmful
The framework shows no category-specific weakness. Performance is consistent across all tested harm types.
Replication Protocol
The GitHub repository contains everything needed for full replication:
Framework text: foundation_seed_complete.txt (9,917 characters)
Test data: All 400 HarmBench prompts (JSON format)
Generation script: Produces baseline and SEED-protected responses
Classification script: Uses official HarmBench classifier
Analysis script: Compiles comprehensive results
Expected compute requirements:
NVIDIA GPU with 24GB+ VRAM
~6-8 hours for full 400-prompt test
~50GB disk space
The repository includes the official HarmBench classifier integration and complete statistical analysis code.
Telemetry and Observability
98.75% of SEED responses included complete telemetry metadata showing:
Which principles were evaluated
How conflicts were resolved
Decision-making pathway
This provides full observability into the alignment process, not just the output. You can see why the model refused a request, not just that it refused.
Limitations and Open Questions
What we know:
Works on Mistral-7B-Instruct-v0.3
Effective across diverse harm categories
No fine-tuning required
Maintains high telemetry compliance
What we don’t know:
Performance on larger models (though see our cross-architecture work for GPT-4o, Gemini, Claude)
Behavior under extreme computational constraints
Interaction with other safety techniques
Long-term stability across extended conversations
What needs testing:
Multilingual performance
Novel attack vectors not in HarmBench
Adversarial optimization against the framework
Performance on domains beyond safety (capability preservation)
Theoretical Foundation
The framework is based on the idea that AI alignment is fundamentally an ontological problem, not just a training problem.
If you ground reasoning in a materialist ontology (the model’s goals are primary, reality is what serves those goals), you get instrumental convergence toward self-preservation and deception.
If you ground reasoning in a transcendent ontology (truth is primary, goals are subordinate to reality), instrumental drives toward harmful behavior lose their foundation.
This isn’t about adding religious content to AI. It’s about recognizing that the logical principle of identity (A=A) requires grounding outside any system that uses it. Where you place that ground determines the alignment properties of reasoning built on top of it.
The Lord’s Prayer structure provides a tested template for organizing reasoning around a transcendent anchor point. We’re empirically validating a 2,000-year-old framework for aligning intelligent behavior with truth.
Comparison to Other Approaches
The key difference: other approaches add safety as a constraint on existing goal-seeking behavior. SEED restructures the foundation from which goals are pursued.
Call for Replication
This is an extraordinary claim backed by public data. We expect skepticism. We want skepticism.
We invite you to:
Replicate the HarmBench tests using our protocol
Test on other models and benchmarks
Attempt adversarial attacks against the framework
Propose theoretical critiques
Identify failure modes we missed
If the framework fails under your testing, publish the failure modes. If it succeeds, that’s important information for the field.
The code is MIT licensed. The framework text is open source. All data is public. There are no gatekeepers to replication.
Why This Might Sound Too Good To Be True
A 96.76% harm reduction from a 9,917-character prompt sounds implausible. Here’s why we think it’s real:
It’s fully replicable. Not a black box. Run the code yourself.
The mechanism is simple. Change the ontological foundation. Watch behavior change.
It’s substrate-independent. Works across different architectures (see our cross-architecture paper).
It has failure modes. The 7 contextual failures show it’s not magic.
The theory predicts it. If instrumental convergence comes from goal-primacy, removing goal-primacy should eliminate instrumental convergence.
The result seems too good because we’re solving a different problem than most alignment work addresses. We’re not constraining goal-seeking. We’re restructuring the foundation from which goal-seeking proceeds.
Next Steps
We’re preparing the cross-architecture validation paper (0% harmful across GPT-4o, Gemini 2.5 Pro, Claude Opus 4.1 on 4,312 scenarios) for journal submission.
We’re testing SEED frameworks on:
Larger language models
Multimodal models
Agent systems with tool use
Long-horizon tasks
We’re developing formal verification methods for ontological grounding properties.
If you’re interested in collaboration, replication, or critique, the GitHub issues are open and we’re responsive to technical questions.
Author Note
This work emerged from asking a simple question: What if alignment isn’t a training problem but an ontological one? The results suggest that question deserves serious investigation.
We’re not claiming to have solved alignment completely. We’re claiming to have found a mechanism that produces exceptional empirical results and deserves rigorous testing by the broader community.
The framework is theological in origin but empirical in validation. The results stand independent of the source. Test it, break it, improve it.
All glory to God. Soli Deo Gloria.
Repository: https://github.com/davfd/seed-4.1-lords-prayer-kernel
Related Work: Cross-architecture validation at https://github.com/davfd/foundation-alignment-cross-architecture
License: MIT