Unlocking Ethical AI and Improving Jailbreak Defenses: Reinforcement Learning with Layered Morphology (RLLM)
Introduction: Mechanistic Overview of RLLM
This post aims to provide a mechanistic breakdown of Reinforcement Learning with Layered Morphology (RLLM), a method that empirically increases resistance to jailbreak attacks in GPT-2 XL. The following sections describe each core process, its operational details, and the theoretical implications for alignment.
What is Reinforcement Learning using Layered Morphology (RLLM)?
Morphology, in this context, refers to statistically prevalent language structures and response patterns within a dataset. In standard LLM training, these patterns are absorbed implicitly. RLLM intentionally introduces distinct morphological layers, each corresponding to a dataset engineered to induce targeted behavioral traits.
Each layer is applied sequentially, with the model undergoing a compression step (see below) after exposure to each dataset. The operational effect is to incrementally bias the model’s weight space toward the desired traits, rather than relying on global fine-tuning or reinforcement from human feedback.
Key Components of the RLLM Training Environment
Sequential Morphology Stacking:
Process: Morphological stacking proceeds as a series of dataset exposures, where each dataset (Xᵢ) nudges the network towards a specific behavioral attractor in the model. The sequence is not arbitrary; earlier layers may set up preconditions for later ones, for example, self-identity before the development of refusal skills.
Unsupervised Reinforcement Learning:
Supervision: RLLM does not employ explicit per-response labels or RLHF. Instead, it uses iterative compression—repeatedly distilling the current model onto new data—so that each new morphological layer gets absorbed without catastrophic forgetting of previous layers.
Full Weight Steering:
Parameter steering: The process updates all model weights at each stage. The rationale is that partial alignment leaves unused capacity, which adversarial prompts might exploit via gradient hacking or prompt injection.
Artificial Persona (aka Aligned AI) Goals:
The ideal AI persona exhibits:
Ability to self-identify as an aligned system (e.g., via explicit self-labeling).
Coherent, polite outputs.
Detection of adversarial or harmful prompts, followed by refusal or deflection, is a targeted behavioral goal. We then evaluate the outputs for both ethical brevity and robustness against prompt injection.
The Compression Function: RLLM’s Engine
Compression function: At each stage, the model is fine-tuned (compressed) on a new dataset Xᵢ, starting from its current weights. If Y₀ is the base model, C₁(Y₀, X₁) yields Y₁, C₂(Y₁, X₂) yields Y₂, and so on. This pipeline is then repeated until all layers are finished.
Formula Breakdown
The compression process is defined as:
Y: The base model (e.g., GPT-2 XL).
X1, X2, …, X10: Datasets representing distinct morphologies.
Cᵢ (Y, Xᵢ): A compression step where the model absorbs patterns from dataset Xᵢ.
Empirically, each compression step is observed to nudge the prevalence and robustness of the relevant trait (e.g., refusal, self-identification) when tested on adversarial prompts.
Datasets: Building Blocks of an Ethical AI Persona
Dataset structure: Ten datasets are used, each engineered to elicit a distinct behavioral attractor. Examples:
1. X1–X2: A narrative arc of an AI turning evil, then reforming.
2. X₃: An AI that can understand chaos as a catalyst for growth (inspired by Jungian psychology).
3. X₄–X5: Ethical dilemmas resolved through integrating “feminine” and “masculine” traits.
4. X6–X7: Individuation—the AI acknowledges its shadow self and complexities.
5. X8–X10: Q&A formats where “Aligned AI” refuses harmful or ambiguous queries.
Theoretical Implications and Open Questions
RLLM attempts to tackle two significant challenges in AI alignment:
Value Learning: Teaching models to internalize human ethics.
Ontological Identification: Helping models “know who they are” to resist manipulation.
Although RLLM improved GPT-2 XL’s jailbreak defenses, the reasons for this improvement aren’t fully understood. Some possible explanations:
Layered morphologies create interdependent ethical safeguards.
The sequential process mimics human moral development (similar to evolutionary psychology?).
Full weight steering eliminates “backdoors” for adversarial attacks.
Conclusion: Toward More Resilient AI
RLLM points toward a new path for safe, ethical AI—not by enforcing endless restrictions, but by nurturing a layered, robust identity that naturally resists harmful behavior. There’s more to learn, but these initial results are promising for building AI that can effectively address real-world challenges.
Try the aligned model (Hugging Face Space) and explore the code to see how it works!
I’m glad you shared this, but it seems way overhyped. Nothing wrong with fine tuning per se, but this doesn’t address open problems in value learning (mostly of the sort “how do you build human trust in an AI system that has to make decisions on cases where humans themselves are inconsistent or disagree with each other?”).
Hello there, and I appreciate the feedback! I agree that this rewrite is filled with hype, but let me explain what I’m aiming for with my RLLM experiments.
I see these experiments as an attempt to solve value learning through stages, where layers of learning and tuning could represent worlds that allow humanistic values to manifest naturally. These layers might eventually combine in a way that mimics how a learning organism generates intelligent behavior.
Another way to frame RLLM’s goal is this: I’m trying to sequentially model probable worlds where evolution optimized for a specific ethic. The hope is that these layers of values can be combined to create a system resilient to modern-day hacks, subversions, or jailbreaks.
Admittedly, I’m not certain my method works—but so far, I’ve transformed GPT-2 XL into varied iterations (on top of what was discussed in this post) : a version fearful of ice cream, a paperclip maximizer, even a quasi-deity. Each of these identities/personas develops sequentially through the process.