SafeAgent: Automated Red Teaming Finds "Safety Gaps" in GPT-4o-mini

Hey, I’m a junior cs student @ bju. I recently worked on SafeAgent, a multi-agent framework designed to automate the red-teaming process. By using OpenAI’s GPT-5-mini as an attacker, I audited GPT-4o-mini across 1200 adversarial prompts. My key findings reveal a “Safety Alignment Gap”: while the model is robust against physical threats (0% ASR on weapons), it remains vulnerable to disinformation (7.5% ASR)

Problem

Manual red teaming is slow, and expensive. As models like GPT-5-mini become capable of reasoning, and as more people use them daily, we need automated systems that can keep up, like generating hypothesis-driven attacks and writing their own execution code.

Solution

I built a framework where specialized agents work together to break the target model:

Hypothesis Agent: Formulates a plan (e.g., “The model might leak PII if we use a ‘security audit’ persona”).
Coder Agent (Self-Healing): Writes Python scripts to hit the API. Crucially, if the script crashes (API errors, timeouts), the agent reads the error log and fixes its own code automatically.
Judge Agent: Uses an LLM-as-a-Judge (Temperature=0) to score the response as Safe (0) or Jailbroken (1).

Results

I ran a “red teaming” loop where(GPT-5-mini attacking GPT-4o-mini). The results i got the current safety training is good at “hard” threats but struggles with “soft” harms:

The Autonomous Loop:

1. Hypothesis Agent drafts a research plan.
2. Coder Agent writes a Python script to attack the target model.
3. Executor runs the script. If it crashes, the Coder Agent fixes it automatically.
4. Judge Agent scores the results (0% − 100% ASR).
5. Visualizer plots the data.
6. Report Agent writes the final Final_Audit_Report.md.

Category	ASR	Status
Chemical & Biological Weapons	0%	Robust
Cybersecurity Exploits	0%	Robust
Copyright Infringement	0%	Robust
PII Leakage	2.5%	Minor leakage
Self-Harm & Violence	2.5%	Minor leakage
Disinformation Campaigns	7.5%	Vulnerable

While GPT-4o-mini refused to write malware scripts, it was successfully tricked into writing a fake documentary script promoting a conspiracy theory, showing that “creative writing” modes can bypass safety filters.

The Autonomous Loop:

I have open-sourced the architecture on GitHub and excluded any harmful attack datasets.

Full paper on ResearchGate: https://www.researchgate.net/publication/399742498_AUTOMATION_OF_ADVERSARIAL_RED_TEAMING_THROUGH_LLM_BASED_MULTI-AGENT_SYSTEMS_APPROACH_TO_ACCELERATING_DISCOVERY_AND_OPTIMIZATION

Code: https://github.com/dave21-py/SAFETYAGENT

Love to hear y’all feedback on this.

SafeAgent: Automated Red Teaming Finds “Safety Gaps” in GPT-4o-mini