OpenAI’s Superalignment aimed for a human-level automated alignment engineer through scalable training, validation, and stress testing.
I propose a faster route: first develop an automated Agent Alignment Engineer. This system would automate the creation of aligned agents for diverse tasks by iteratively refining agent group chats, prompts, and tools until they pass success and safety evaluations.
This is tractable with today’s LLM reasoning, evidenced by coding agents matching top programmers. Instead of directly building a full Alignment Researcher, focusing on this intermediate step leverages current LLM strengths for agent orchestration. This system could then automate many parts of creating a broader Alignment Researcher.
Safety for the Agent Alignment Engineer can be largely ensured by operating in internet-disconnected environments (except for fetching research) with subsequent human verification of agent alignment and capability.
Examples: This Engineer could create agents that develop scalable training methods or generate adversarial alignment tests.
By prioritizing this more manageable stepping stone, we could significantly accelerate progress towards safe and beneficial advanced AI.
OpenAI’s Superalignment aimed for a human-level automated alignment engineer through scalable training, validation, and stress testing.
I propose a faster route: first develop an automated Agent Alignment Engineer. This system would automate the creation of aligned agents for diverse tasks by iteratively refining agent group chats, prompts, and tools until they pass success and safety evaluations.
This is tractable with today’s LLM reasoning, evidenced by coding agents matching top programmers. Instead of directly building a full Alignment Researcher, focusing on this intermediate step leverages current LLM strengths for agent orchestration. This system could then automate many parts of creating a broader Alignment Researcher.
Safety for the Agent Alignment Engineer can be largely ensured by operating in internet-disconnected environments (except for fetching research) with subsequent human verification of agent alignment and capability.
Examples: This Engineer could create agents that develop scalable training methods or generate adversarial alignment tests.
By prioritizing this more manageable stepping stone, we could significantly accelerate progress towards safe and beneficial advanced AI.