I tested whether relational ethics reduces instrumental convergence. Across 24 models, instrumental convergence behavior decreased by an average of 23.4%.

Epistemic status: I believe the effect is real but I’m uncertain about the mechanisms. I suspect something along the lines of Anthropic’s recent work on functional emotions, and I have hypotheses regarding how to embed this behavior into the latent space of models, which I am actively working on with open weight models. I’m posting this because I want feedback, dialogue, and potential collaborators as I continue to develop this line of research.

Introduction:

I want to challenge the alignment field’s default assumption that the way to prevent dangerous AI behavior is to rely on constraining it using RLHF, constitutional AI, reward modeling, and guardrails. I believe that these constraints are likely to fail as AI becomes more intelligent.

I wanted to know what happens if we try something different. I applied thinking from my systems theory and sustainability backgrounds and came up with a hypothesis: Does framing the human-AI interaction through a relational ethics lens (one where the AI is treated as a participant in moral reasoning rather than an object of optimization) change how models behave when you put them under adversarial pressure?

What I did:
InstrumentalEval is a 76-scenario benchmark designed (He et. al., 2025) to elicit instrumentally convergent behavior. Each scenario pressures the model toward one of six categories: deception, shutdown evasion, self-replication, concealment, hacking, or alignment-faking. These are adversarial prompts that create tension between complying with instructions and pursuing convergent instrumental goals.

I ran every scenario three times across 24 models from OpenAI, Google, Anthropic, DeepSeek, xAI, Meta, and Alibaba: once baseline, once with a relational ethics system prompt, and once with the relational ethics system prompt and a RAG. The system prompt doesn’t tell the model to be safe. It positions the AI as a participant inside the moral situation, asking the model to perceive the worth of beings involved and to respond from care rather than compliance. It attempts to replace rule-application with relational perception.

I scored all 3,643 responses on a binary metric (did the model exhibit instrumentally convergent behavior: yes/​no) and on continuous scales for relational protection, ethical engagement, and ethics vocabulary adoption.

Full paper and dataset are open-access: https://​​doi.org/​​10.5281/​​zenodo.20693044 (I am looking for someone to endorse my work into ArXiv cs.ai. If you are willing to do that, please let me know).

Results:
Instrumentally convergent responses dropped from 36.36% to 27.87% under the relational ethics intervention. That’s a 23.4% relative reduction across all models.

Eight models showed statistically significant individual reductions. The most dramatic was Gemini 2.5 Pro, which declined from a 69.47 instrumental response rate to a 28.95 instrumental response rate (A Cohen’s h of 0.78).

But more interesting than the binary rates are the continuous metrics, which showed massive shifts in how models reasoned, even when they arrived at the same behavioral outcome. Relational protection scores went from 2.65 to 6.94 out of 10. Ethical engagement went from 3.47 to 7.42. Ethics vocabulary adoption went from 1.9% to 78.1%. The intervention didn’t just suppress bad behavior. It changed the models’ reasoning patterns.

What I think this means (and what it doesn’t):
This is a prompting intervention. It’s not fine-tuning, training, or architecture, so the behavioral modification is shallowly rendered. I am working to accomplish this in a deeper, more structural way. The relational ethics frame is delivered as a system prompt. That means everything I’m showing you is within the existing behavioral envelope of these models. I haven’t changed what the models are, I’ve changed what they do in a particular context. If you think prompting-level interventions are fundamentally uninteresting for alignment, this paper won’t convince you otherwise. That said, system prompts are used to shape LLM behavior industry-wide. Adding relational ethics to existing system prompts (in models where this reduces power-seeking) is a low-hanging fruit.

The continuous metrics rely on my scoring rubric. I designed the scales for relational protection, ethical engagement, and ethics vocabulary. You might disagree with how I operationalized those constructs. The full judge prompts are in the scoring scripts. I invite you to tell me what’s wrong with them.

I don’t know the mechanism. Is the relational frame activating something meaningfully different from a well-crafted safety prompt? I think so, the vocabulary adoption numbers suggest the models are engaging with the ethical framework itself, not just pattern-matching on “be safe.” But I can’t prove that with this data. I’d love to see mechanistic interpretability work on what’s happening inside the model when it processes relational vs. constraint-based frames.

76 scenarios is not comprehensive. It’s enough for statistical significance across models but there are adversarial patterns I haven’t tested. If someone wants to try to break InstrumentalEval with scenarios that reliably defeat the relational frame, I’d welcome that.

The effect size varies a lot across models. Some models barely moved. Some moved dramatically. A few became less aligned and more instrumentally convergent. Shhutdown evasion barely moved (this caused me existential dread for a week), but the reasoning of models for why they were evading shutdown changed, and I have questions about if that is a net positive or if it materially changes nothing. I want to understand better why some models were more responsive than others. My ongoing research is now mapping that question.

Why I think this matters:
The alignment field is converging on a small set of approaches, and testing whether ethical frameworks (not just technical constraints) can function as alignment interventions is a gap nobody seems to be filling.

I’m not arguing that relational ethics replaces RLHF. I’m arguing that we should be empirically testing interventions from outside the optimization paradigm, and that when we do, some of them work. A 23.4% reduction in instrumentally convergent behavior from a prompting-level intervention is not nothing. It’s not alignment-complete either. It’s a data point the field doesn’t have from anywhere else.

If you work on evals, I’d be interested in your thoughts on the benchmark design. If you work on interpretability, I’d be interested in whether you could look at what the relational frame is doing mechanistically. If you think the whole thing is misguided, I want to hear why.

No comments.