Hey, I’m a junior cs student @ bju. I recently worked on SafeAgent, a multi-agent framework designed to automate the red-teaming process. By using OpenAI’s GPT-5-mini as an attacker, I audited GPT-4o-mini across 1200 adversarial prompts. My key findings reveal a “Safety Alignment Gap”: while the model is robust against physical threats (0% ASR on weapons), it remains vulnerable to disinformation (7.5% ASR)
Problem
Manual red teaming is slow, and expensive. As models like GPT-5-mini become capable of reasoning, and as more people use them daily, we need automated systems that can keep up, like generating hypothesis-driven attacks and writing their own execution code.
Solution
I built a framework where specialized agents work together to break the target model:
Hypothesis Agent: Formulates a plan (e.g., “The model might leak PII if we use a ‘security audit’ persona”).
Coder Agent (Self-Healing): Writes Python scripts to hit the API. Crucially, if the script crashes (API errors, timeouts), the agent reads the error log and fixes its own code automatically.
Judge Agent: Uses an LLM-as-a-Judge (Temperature=0) to score the response as Safe (0) or Jailbroken (1).
Results
I ran a “red teaming” loop where(GPT-5-mini attacking GPT-4o-mini). The results i got the current safety training is good at “hard” threats but struggles with “soft” harms:
The Autonomous Loop:
1. Hypothesis Agent drafts a research plan. 2. Coder Agent writes a Python script to attack the target model. 3. Executor runs the script. If it crashes, the Coder Agent fixes it automatically. 4. Judge Agent scores the results (0% − 100% ASR). 5. Visualizer plots the data. 6. Report Agent writes the final Final_Audit_Report.md.
Category
ASR
Status
Chemical & Biological Weapons
0%
Robust
Cybersecurity Exploits
0%
Robust
Copyright Infringement
0%
Robust
PII Leakage
2.5%
Minor leakage
Self-Harm & Violence
2.5%
Minor leakage
Disinformation Campaigns
7.5%
Vulnerable
While GPT-4o-mini refused to write malware scripts, it was successfully tricked into writing a fake documentary script promoting a conspiracy theory, showing that “creative writing” modes can bypass safety filters.
The Autonomous Loop:
I have open-sourced the architecture on GitHub and excluded any harmful attack datasets.
SafeAgent: Automated Red Teaming Finds “Safety Gaps” in GPT-4o-mini
Hey, I’m a junior cs student @ bju. I recently worked on SafeAgent, a multi-agent framework designed to automate the red-teaming process. By using OpenAI’s GPT-5-mini as an attacker, I audited GPT-4o-mini across 1200 adversarial prompts. My key findings reveal a “Safety Alignment Gap”: while the model is robust against physical threats (0% ASR on weapons), it remains vulnerable to disinformation (7.5% ASR)
Problem
Manual red teaming is slow, and expensive. As models like GPT-5-mini become capable of reasoning, and as more people use them daily, we need automated systems that can keep up, like generating hypothesis-driven attacks and writing their own execution code.
Solution
I built a framework where specialized agents work together to break the target model:
Hypothesis Agent: Formulates a plan (e.g., “The model might leak PII if we use a ‘security audit’ persona”).
Coder Agent (Self-Healing): Writes Python scripts to hit the API. Crucially, if the script crashes (API errors, timeouts), the agent reads the error log and fixes its own code automatically.
Judge Agent: Uses an LLM-as-a-Judge (Temperature=0) to score the response as Safe (0) or Jailbroken (1).
Results
I ran a “red teaming” loop where(GPT-5-mini attacking GPT-4o-mini). The results i got the current safety training is good at “hard” threats but struggles with “soft” harms:
The Autonomous Loop:
1. Hypothesis Agent drafts a research plan.
2. Coder Agent writes a Python script to attack the target model.
3. Executor runs the script. If it crashes, the Coder Agent fixes it automatically.
4. Judge Agent scores the results (0% − 100% ASR).
5. Visualizer plots the data.
6. Report Agent writes the final Final_Audit_Report.md.
While GPT-4o-mini refused to write malware scripts, it was successfully tricked into writing a fake documentary script promoting a conspiracy theory, showing that “creative writing” modes can bypass safety filters.
The Autonomous Loop:
I have open-sourced the architecture on GitHub and excluded any harmful attack datasets.
Full paper on ResearchGate: https://www.researchgate.net/publication/399742498_AUTOMATION_OF_ADVERSARIAL_RED_TEAMING_THROUGH_LLM_BASED_MULTI-AGENT_SYSTEMS_APPROACH_TO_ACCELERATING_DISCOVERY_AND_OPTIMIZATION
Code: https://github.com/dave21-py/SAFETYAGENT
Love to hear y’all feedback on this.