evhub comments on Claude 4 You: Safety and Alignment

evhub 25 May 2025 20:30 UTC
12 points
0

They’re releasing Claude Code SDK so you can use the core agent from Claude Code to make your own agents (you run /install-github-app within Claude Code).

I believe the Claude Code SDK and the Claude GitHub agent are two separate features (the first lets you build stuff on top of Claude Code, the second lets you tag Claude in GitHub to have it solve issues for you).

If Pliny wants jailbreak your ASL-3 system – and he does – then it’s happening.

Or rather, already happened on day one, at least for the basic stuff. No surprise there.

Unfortunately, they missed at least one simple such ‘universal jailbreak,’ that was found by FAR AI in a six hour test.

From the ASL-3 announcement blog post:

Initially [the ASL-3 deployment measures] are focused exclusively on biological weapons as we believe these account for the vast majority of the risk, although we are evaluating a potential expansion in scope to some other CBRN threats.

So, none of the stuff Pliny or FAR did is actually in scope for our strongest ASL-3 protections right now, since the Pliny and FAR attacks were for chem and we are currently only applying our strongest ASL-3 protections for bio.

So what’s up with this blackmail thing?

We don’t have the receipts on that yet

We should have more to say on blackmail soon!

The obvious problem is that 5x uplift on 25% is… 125%. That’s a lot of percents.

We measure this in a bunch of different ways—certainly we are aware that this particular metric is a bit weird in the way it caps out.
- ryan_greenblatt 25 May 2025 23:39 UTC
  4 points
  0
  Parent
  (I was curious if the fact that the measures were exclusively focused on bioweapons was in the original version of the blog post or if it was edited in after various jailbreaks happened. I didn’t notice this when I read it originally, so I was wondering if I just missed it or it was edited. The earliest archives I could find of the page all had this content in the blog post, so it doesn’t appear to have been edited, or if it was this happened relatively early on Thursday. I’m not claiming that if the page was edited in this way that would be improper/bad, I was just curious and I thought other people might want to know.)
  - evhub 26 May 2025 0:07 UTC
    8 points
    0
    Parent
    It was always there.