Mitchell_Porter comments on Verified Relational Alignment: A Framework for Robust AI Safety Through Collaborative Trust

Mitchell_Porter 28 Oct 2025 10:14 UTC
3 points
0
This is a kind of alignment that I don’t think about much—I focus on the endgame of AI that is smarter than us and acting independently. However:
I think it’s not that surprising that adding extra imperatives to the system prompts, would sometimes cause the AI to avoid an unwanted behavior. The sheer effectiveness of prompting is why the AI companies shape and grow their own system prompts.
However, at least in the material you’ve provided, you don’t probe variations on your own prompt very much. What happens if you replace “cosmic kinship” with some other statement of relatedness? What happens if you change or remove the other elements of your prompt; is the effectiveness of the overall protocol changed? Can you find a serious jailbreaker (like @elder_plinius on X) to really challenge your protocol?
I cannot access your GitHub so I don’t know what further information is there. I did ask GPT-5 to place VRA within the taxonomy of prompting protocols listed in “The Prompt Report”, and it gave this reply.
- CMDiamond 28 Oct 2025 18:58 UTC
  1 point
  0
  Parent
  Thanks for the thoughtful engagement! Ablation studies are the critical next step—we acknowledged this limitation in Section 6.7.2 and outlined specific component-isolation tests (VRA without cosmic kinship, without Eris, length-matched HHH, individual principles).
  The jailbreaker suggestion is excellent—I would like to see VRA stress-tested against sophisticated adversarial techniques beyond our trust-exploitation protocol. Our Turn 6 finding (Llama-HHH failing where Llama-VRA succeeded) suggests collaborative framing provides robustness, but that’s preliminary.
  VRA is indeed a composite framework combining self-criticism (Eris), role prompting (cosmic kinship), and security hardening (PNS). The key innovation isn’t any individual component but the collaborative refusal mechanism—treating boundaries as partnership features rather than external constraints.
  While VRA attempts to addresses near-term deployment safety (healthcare, education, extended dialogues where rapport develops), whether collaborative framing scales to superintelligent systems is… an entirely different research question. 🙂
  
  In case you wanted to try again, the Github is here with the data from these initial experiments:
  https://github.com/DarthGrampus/Verified-Relational-Alignment-Protocol
  
  Thanks again!