Currently (2025) Learning about AI Safety to make sure AI is done right for mankind and beyond
reinthal
reinthal’s Shortform
Fair point and I agree that there is merit to preparing for a maybe-not-so-distant future where we have models that actively scheme against us. What I argue here is that there could be more ways of measuring how to prevent bad outcomes through monitoring than just banging on the safety-usefulness number. Time-to-stopped-model is one such example, and it’s linked to the safety-usefulness frontier. I bet there are several others if you look closer.
Thanks! Good call on making this line of reasoning more explicit.
Your attack policy is supposed to give evidence of a bound on what a real scheming model can do. Assume the attack policy is written by a non-AI-augmented red team. Under that assumption the attack policy is still auditable. You can look at it and say “This attack will work for XY reasons” and test your hypothesis against the monitor. When you relax this constraint and let AI generate the attack policy, it can look auditable, but the provenance is broken — you can reason about what makes it work, but the reasoning is ad-hoc, which is bad epistemic practice.
On a different but related note: An untrusted model doing side-tasks implicitly assumes that the untrusted model will perform its’ side-task to the best of its’ abilities. There’s implicit trust to not scheme, i.e sandbag; do the side-tasks poorly and get caught. If a schemer is testing your control protocols and generating your attack policies, you are kinda doomed.
The Changing North Star of AI Control
Hi,
Being a red-teamer on an AI control project, this post was very helpful.
I have some kudos on the write-up I would like to share:
Clarity in the writing
Repeated findings. Specifically, these headings drilled your findings clearly into my head
Executive summary
“What makes our attacks good?”
“Conclusions/Takeaways:”
Methodology—the human seeded llm elicitation approach and splitting the solution space into categories reminded me of the eval-generation exercise from ARENA. Also, cleaning up / hiding things in some trajectories was a nice touch to simulate even more capable future models.
Here are some of my thoughts on what could make your research even better:
Clarity—The audit budget it not clearly explained. What does it mean to have a 1% audit budget for 400 trajectories specifically?
Technical—You mentioned in most monitors detect attacks but fail to escalate. Reading your audit prompt you state that there is a 1% audit budget, supposedly to lower FPR. Could the mention of the 1% audit budget be a confounding factor for this result? What happens if you remove this string from the prompt? Why are you not using this approach (and maybe the o.g paper) to get better calibrated scores?
Two questions:
Technical—Between prompt0 and prompt1 you switch from Security analyst (prompt1) to SWE (prompt0). What made you change this? Did the role-change have a positive impact? The task, finding a firewall misconfiguration, is primarily a network security engineering task.
Technical—How are you handling refusals by the model? With Opus-4.6 i get >80% refusal for the data poisoning scenario for example.
I liked:
GDPval! I am interested in how this relates to the METR timehorizon study
Wisdom graph
Consider including:
Graphs on power consumption
and historical scaling laws
If you care about ordering:
Assuming that the wisdom gap is the hook then this image should come first
The transition from 4x cheaper to time-horizon test might be more obvious given the full blog post context. Out-of-context I fail to see a clear story-telling transition
I would love to rry the work test also!