It’s great that Anthropic did a detailed risk report on sabotage risk (for Opus 4) and solicited an independent review from METR.
I hope other AI companies do similar analysis+reporting+transparency about risk with this level of rigor and care.
[...]
I think this sort of moderate-access third party review combined with a detailed (and thoughtful) risk report can probably provide a reasonably accurate picture of the current situation with respect to risk (if we assume that AI companies and their employees don’t brazenly lie).
That said, it’s not yet clear how well this sort of process will work when risk is large (or at least plausibly large) and thus there are higher levels of pressure. Selecting a bad/biased third-party reviewer for this process seems like a particularly large threat.
As far as I can tell, Anthropic did a pretty good job with this risk report (at least procedurally), but I haven’t yet read the report in detail.
Copying over most what I wrote about this on X/twitter:
It’s great that Anthropic did a detailed risk report on sabotage risk (for Opus 4) and solicited an independent review from METR.
I hope other AI companies do similar analysis+reporting+transparency about risk with this level of rigor and care.
[...]
I think this sort of moderate-access third party review combined with a detailed (and thoughtful) risk report can probably provide a reasonably accurate picture of the current situation with respect to risk (if we assume that AI companies and their employees don’t brazenly lie).
That said, it’s not yet clear how well this sort of process will work when risk is large (or at least plausibly large) and thus there are higher levels of pressure. Selecting a bad/biased third-party reviewer for this process seems like a particularly large threat.
As far as I can tell, Anthropic did a pretty good job with this risk report (at least procedurally), but I haven’t yet read the report in detail.