reinthal

Karma: 17

Currently (2025) Learning about AI Safety to make sure AI is done right for mankind and beyond

reinthal 6 Jun 2026 9:29 UTC
1 point
0
in reply to: RedMan’s comment on: Learnings from starting an AI safety research team
I would love to rry the work test also!

reinthal’s Shortform

reinthal5 Jun 2026 8:51 UTC

2 points

0 comments1 min readLW link

reinthal 22 Apr 2026 21:25 UTC
2 points
0
in reply to: Ram Potham’s comment on: The Changing North Star of AI Control
Fair point and I agree that there is merit to preparing for a maybe-not-so-distant future where we have models that actively scheme against us. What I argue here is that there could be more ways of measuring how to prevent bad outcomes through monitoring than just banging on the safety-usefulness number. Time-to-stopped-model is one such example, and it’s linked to the safety-usefulness frontier. I bet there are several others if you look closer.

reinthal 22 Apr 2026 21:09 UTC
7 points
0
in reply to: anaguma’s comment on: The Changing North Star of AI Control
Thanks! Good call on making this line of reasoning more explicit.
Your attack policy is supposed to give evidence of a bound on what a real scheming model can do. Assume the attack policy is written by a non-AI-augmented red team. Under that assumption the attack policy is still auditable. You can look at it and say “This attack will work for XY reasons” and test your hypothesis against the monitor. When you relax this constraint and let AI generate the attack policy, it can look auditable, but the provenance is broken — you can reason about what makes it work, but the reasoning is ad-hoc, which is bad epistemic practice.
On a different but related note: An untrusted model doing side-tasks implicitly assumes that the untrusted model will perform its’ side-task to the best of its’ abilities. There’s implicit trust to not scheme, i.e sandbag; do the side-tasks poorly and get caught. If a schemer is testing your control protocols and generating your attack policies, you are kinda doomed.

The Changing North Star of AI Control

reinthal22 Apr 2026 12:59 UTC

10 points

4 comments2 min readLW link

reinthal 19 Feb 2026 20:36 UTC
1 point
0
on: Monitoring benchmark for AI control
Hi,
Being a red-teamer on an AI control project, this post was very helpful.
I have some kudos on the write-up I would like to share:
- Clarity in the writing
- Repeated findings. Specifically, these headings drilled your findings clearly into my head
  - Executive summary
  - “What makes our attacks good?”
  - “Conclusions/Takeaways:”
- Methodology—the human seeded llm elicitation approach and splitting the solution space into categories reminded me of the eval-generation exercise from ARENA. Also, cleaning up / hiding things in some trajectories was a nice touch to simulate even more capable future models.
Here are some of my thoughts on what could make your research even better:
- Clarity—The audit budget it not clearly explained. What does it mean to have a 1% audit budget for 400 trajectories specifically?
- Technical—You mentioned in most monitors detect attacks but fail to escalate. Reading your audit prompt you state that there is a 1% audit budget, supposedly to lower FPR. Could the mention of the 1% audit budget be a confounding factor for this result? What happens if you remove this string from the prompt? Why are you not using this approach (and maybe the o.g paper) to get better calibrated scores?
Two questions:
- Technical—Between prompt0 and prompt1 you switch from Security analyst (prompt1) to SWE (prompt0). What made you change this? Did the role-change have a positive impact? The task, finding a firewall misconfiguration, is primarily a network security engineering task.
- Technical—How are you handling refusals by the model? With Opus-4.6 i get >80% refusal for the data poisoning scenario for example.

reinthal 19 Oct 2025 13:42 UTC
3 points
0
in reply to: Chris_Leong’s comment on: Selected Graphics Showing Progress towards AGI
I liked:
- GDPval! I am interested in how this relates to the METR timehorizon study
- Wisdom graph
Consider including:
- Graphs on power consumption
- and historical scaling laws
If you care about ordering:
- Assuming that the wisdom gap is the hook then this image should come first
- The transition from 4x cheaper to time-horizon test might be more obvious given the full blog post context. Out-of-context I fail to see a clear story-telling transition

reinthal

rein­thal’s Shortform

The Chang­ing North Star of AI Control

reinthal’s Shortform

The Changing North Star of AI Control