Nikola Jurkovic comments on nikola’s Shortform

Nikola Jurkovic 29 Oct 2025 21:59 UTC
52 points
11
Anthropic wrote a pilot risk report where they argued that Opus 4 and Opus 4.1 present very low sabotage risk. METR independently reviewed their report and we agreed with their conclusion.

During this process, METR got more access than during any other evaluation we’ve historically done, and we were able to review Anthropic’s arguments and evidence presented in a lot of detail. I think this is a very exciting milestone in third-party evaluations!

I also think that the risk report itself is the most rigorous document of its kind. AGI companies will need a lot more practice writing similar documents, so that they can be better at assessing risks once AI systems become very capable.

I’ll paste the twitter thread below (twitter link)

We reviewed Anthropic’s unredacted report and agreed with its assessment of sabotage risks. We want to highlight the greater access & transparency into its redactions provided, which represent a major improvement in how developers engage with external reviewers. Reflections:

To be clear: this kind of external review differs from holistic third-party assessment, where we independently build up a case for risk (or safety). Here, the developer instead detailed its own evidence and arguments, and we provided external critique of the claims presented.

Anthropic made its case to us based primarily on information it has now made public, with a small amount of nonpublic text that it intended to redact before publication. We commented on the nature of these redactions and whether we believed they were appropriate, on balance.

For example, Anthropic told us about the scaleup in effective compute between models. Continuity with previous models is a key component of the assessment, and sharing this information provides some degree of accountability on a claim that the public cannot otherwise assess.

We asked Anthropic to make certain assurances to us about the models its report aims to cover, similar to the assurance checklist in our GPT-5 evaluation. We then did in-depth follow-ups in writing and in interviews with employees.

We believe that allowing this kind of interactive review was ultimately valuable. In one instance, our follow-ups on the questions we asked helped Anthropic notice internal miscommunications about how its training methods might make chain-of-thought harder to monitor reliably.

A few key limitations. We have yet to see any rigorous roadmap for addressing sabotage risk from AI in the long haul. As AI systems become capable of subverting evaluations and/or mitigations, current techniques like those used in the risk report seem insufficient.

Additionally, there were many claims for which we had to assume that Anthropic was providing good-faith responses. We expect that verification will become more important over time, but that our current practices would not be robust to a developer trying to game the review.

Beyond assessing developer claims, we think there is an important role for third parties to do their own assessments, which might differ in threat models and approach. We would love to see the processes piloted in this review applied to such holistic assessments as well.

Overall, we felt that this review was significantly closer to the sort of rigorous, transparent third-party scrutiny of AI developer claims that we hope to see in the future. Full details on our assessment: https://alignment.anthropic.com/2025/sabotage-risk-report/2025_pilot_risk_report_metr_review.pdf
- ryan_greenblatt 30 Oct 2025 16:40 UTC
  4 points
  0
  Parent
  Copying over most what I wrote about this on X/twitter:
  
  It’s great that Anthropic did a detailed risk report on sabotage risk (for Opus 4) and solicited an independent review from METR.
  
  I hope other AI companies do similar analysis+reporting+transparency about risk with this level of rigor and care.
  
  [...]
  
  I think this sort of moderate-access third party review combined with a detailed (and thoughtful) risk report can probably provide a reasonably accurate picture of the current situation with respect to risk (if we assume that AI companies and their employees don’t brazenly lie).
  
  That said, it’s not yet clear how well this sort of process will work when risk is large (or at least plausibly large) and thus there are higher levels of pressure. Selecting a bad/biased third-party reviewer for this process seems like a particularly large threat.
  
  As far as I can tell, Anthropic did a pretty good job with this risk report (at least procedurally), but I haven’t yet read the report in detail.
- dmz 30 Oct 2025 17:52 UTC
  1 point
  0
  Parent
  I made a linkpost for the report here: https://www.alignmentforum.org/posts/omRf5fNyQdvRuMDqQ/anthropic-s-pilot-sabotage-risk-report-2