Currently (2025) Learning about AI Safety to make sure AI is done right for mankind and beyond
reinthal
Karma: 2
I liked:
GDPval! I am interested in how this relates to the METR timehorizon study
Wisdom graph
Consider including:
Graphs on power consumption
and historical scaling laws
If you care about ordering:
Assuming that the wisdom gap is the hook then this image should come first
The transition from 4x cheaper to time-horizon test might be more obvious given the full blog post context. Out-of-context I fail to see a clear story-telling transition
Hi,
Being a red-teamer on an AI control project, this post was very helpful.
I have some kudos on the write-up I would like to share:
Clarity in the writing
Repeated findings. Specifically, these headings drilled your findings clearly into my head
Executive summary
“What makes our attacks good?”
“Conclusions/Takeaways:”
Methodology—the human seeded llm elicitation approach and splitting the solution space into categories reminded me of the eval-generation exercise from ARENA. Also, cleaning up / hiding things in some trajectories was a nice touch to simulate even more capable future models.
Here are some of my thoughts on what could make your research even better:
Clarity—The audit budget it not clearly explained. What does it mean to have a 1% audit budget for 400 trajectories specifically?
Technical—You mentioned in most monitors detect attacks but fail to escalate. Reading your audit prompt you state that there is a 1% audit budget, supposedly to lower FPR. Could the mention of the 1% audit budget be a confounding factor for this result? What happens if you remove this string from the prompt? Why are you not using this approach (and maybe the o.g paper) to get better calibrated scores?
Two questions:
Technical—Between prompt0 and prompt1 you switch from Security analyst (prompt1) to SWE (prompt0). What made you change this? Did the role-change have a positive impact? The task, finding a firewall misconfiguration, is primarily a network security engineering task.
Technical—How are you handling refusals by the model? With Opus-4.6 i get >80% refusal for the data poisoning scenario for example.