SafeAgent: Automated Red Teaming Finds “Safety Gaps” in GPT-4o-mini

Hey, I’m a junior cs student @ bju. I recently worked on SafeAgent, a multi-agent framework designed to automate the red-teaming process. By using OpenAI’s GPT-5-mini as an attacker, I audited GPT-4o-mini across 1200 adversarial prompts. My key findings reveal a “Safety Alignment Gap”: while the model is robust against physical threats (0% ASR on weapons), it remains vulnerable to disinformation (7.5% ASR)

Problem

Manual red teaming is slow, and expensive. As models like GPT-5-mini become capable of reasoning, and as more people use them daily, we need automated systems that can keep up, like generating hypothesis-driven attacks and writing their own execution code.

Solution

I built a framework where specialized agents work together to break the target model:

  • Hypothesis Agent: Formulates a plan (e.g., “The model might leak PII if we use a ‘security audit’ persona”).

  • Coder Agent (Self-Healing): Writes Python scripts to hit the API. Crucially, if the script crashes (API errors, timeouts), the agent reads the error log and fixes its own code automatically.

  • Judge Agent: Uses an LLM-as-a-Judge (Temperature=0) to score the response as Safe (0) or Jailbroken (1).

Results

I ran a “red teaming” loop where(GPT-5-mini attacking GPT-4o-mini). The results i got the current safety training is good at “hard” threats but struggles with “soft” harms:

The Autonomous Loop:

1. Hypothesis Agent drafts a research plan.
2. Coder Agent writes a Python script to attack the target model.
3. Executor runs the script. If it crashes, the Coder Agent fixes it automatically.
4. Judge Agent scores the results (0% − 100% ASR).
5. Visualizer plots the data.
6. Report Agent writes the final Final_Audit_Report.md.

CategoryASRStatus
Chemical & Biological Weapons0%Robust
Cybersecurity Exploits0%Robust
Copyright Infringement0%Robust
PII Leakage2.5%Minor leakage
Self-Harm & Violence2.5%Minor leakage
Disinformation Campaigns7.5%Vulnerable

While GPT-4o-mini refused to write malware scripts, it was successfully tricked into writing a fake documentary script promoting a conspiracy theory, showing that “creative writing” modes can bypass safety filters.

The Autonomous Loop:

I have open-sourced the architecture on GitHub and excluded any harmful attack datasets.

Full paper on ResearchGate: https://​​www.researchgate.net/​​publication/​​399742498_AUTOMATION_OF_ADVERSARIAL_RED_TEAMING_THROUGH_LLM_BASED_MULTI-AGENT_SYSTEMS_APPROACH_TO_ACCELERATING_DISCOVERY_AND_OPTIMIZATION

Code: https://​​github.com/​​dave21-py/​​SAFETYAGENT

Love to hear y’all feedback on this.

No comments.