Corm

Karma: 180

Hi, I’m Cormac. I am currently trying to find my way towards the most impact I can have in AI existential risk.

sladebyrd.com

Corm 11 Apr 2026 15:45 UTC
2 points
0
in reply to: Hazard_real’s comment on: Anthropic is Really Pushing the Frontier, What Should We Think?
Thanks for responding, and feeling strongly enough to make an account to do so! I appreciate the feedback. I think overall a core belief of mine is building the smartest AI on the planet is extremely high stakes and I (at least) would hope and have high standards for what it looks like to steward that into existence. This bias is certainly baked into this essay and I think it’s reasonable that if you (or anyone else) doesn’t have that frame this essay is less strong, since it’s less important really actually get everything right.

I think there’s a couple things going on here that I appreciate the feedback on. There are certainly beliefs of mine that I don’t think I fully justify within the course of this article. I’ve written a little bit (and thought a lot) about the previous version of the RSP and I certainly agree that
to keep risks low, it is not enough to maintain risk mitigations as capabilities increase — rather, we must accelerate our progress on risk mitigations. While we do not see any fundamental barriers to achieving this, success is far from guaranteed.
does not clearly support that belief of mine. I generally try to make it clear when something isn’t fully supported in the essay by saying something like “I believe” but clearly I didn’t do a great job here.
Another thing I struggled with here was how much to write for someone who already has some context vs. really describing everything exactly and fully. I also struggled with this in the ordering of the various points since frequently they are interrelated.
Anthropic just released what is by far the world’s best AI model
I will note that I did make it very clear that it wasn’t a public release and estimated the total number of people I think have access. How bad is it if someone who stops before they have made it 25% of the way through an essay has an incorrect assessment of reality. I certainly agree it’s not optimal, but it seems kind of impossible to fully solve. Perhaps it’s sufficiently important in this case to footnote it immediately—that seems low cost and sufficiently worth it. I’ll do that.
“For those with access, this model is surprisingly uninhibited”
I have a feeling if they RL trained the model with more inhibitions you would probably be saying something like “despite saying that they won’t release it to the public, they are clearly preparing for it, look at the company’s incentives here”.
I clearly didn’t do a good enough job here of describing the mechanism. This is not at all a claim about the model itself, but that they are choosing to not include the Classifier-based prompt blocking that is normally included in public releases. I do say this in the very next line after the one you quoted: “Because of the very limited and targeted nature of this release, we are not blocking exchanges based on classifier triggers”, so this feels a little uncharitable. But, also I think if I want to write for non technical people not in this field I can’t expect them to know what classifier-based prompt blocking is. I do try to define this later in the essay, but once again ordering is hard
To be explicit, the decision not to make this model generally available does not stem from Responsible Scaling Policy requirements.
The point I clearly mostly failed to make here is that it can’t be blocked by the RSP. The v3 RSP is specifically designed to not block new releases (unlike previous versions)

I will say overall I am a little sad that this critique doesn’t engage with any of the technical arguments which (at least in my eyes) are where I am most unhappy with the state of this release and what can be inferred about how future releases will look. I also understand why that is the case, technical arguments are much harder to assess and know how important they are if you aren’t in the field.
Once again just trying to do my best, and I appreciate the places you’ve spent the time to give feedback on how you read and understood this piece!

Anthropic is Really Pushing the Frontier, What Should We Think?

Corm10 Apr 2026 18:25 UTC

11 points

3 comments11 min readLW link

Corm 8 Apr 2026 19:37 UTC
7 points
2
in reply to: Davidmanheim’s comment on: 101 Humans of New York on the Risks of AI
“That doesn’t work; less dense in NYC generally means significantly wealthier, and NYC already skews much wealthier than average.”

I (attempted) to account for that! I went to both generally poorer and less dense neighborhoods using this neighborhood optimizer I built out: https://sladebyrd.com/canvassing-planner

Which fun fact (perhaps unsurprisingly) led to me canvassing in the neighborhood considered to be the murder capital of NYC. (Which was interestingly my single most productive canvassing session)

There is of course the issue that within a given neighborhood I am still selecting the wealthier residents. I tried adding in bias term to attempt to account for that (ie I assume I am getting responses from people that are $X above the neghborhood median), but I do agree it’s hard.

I include the collected as well as bayesian estimated demographical data of my responses at the end of the report.

101 Humans of New York on the Risks of AI

Corm8 Apr 2026 17:21 UTC

39 points

3 comments7 min readLW link

InkSF, an Opening on Finding the Highest Impact in AI Safety and Moving to SF

Corm1 Apr 2026 19:01 UTC

4 points

0 comments4 min readLW link

What Are My Values?

Corm16 Mar 2026 20:43 UTC

7 points

0 comments8 min readLW link

Corm 6 Mar 2026 16:30 UTC
1 point
0
in reply to: the gears to ascension’s comment on: I Had Claude Read Every AI Safety Paper Since 2020, Here’s the DB
Okay, I finished the first pass at this: https://sladebyrd.com/ai-safety-db/posts
any thoughts?

Corm 4 Mar 2026 13:59 UTC
1 point
0
in reply to: the gears to ascension’s comment on: I Had Claude Read Every AI Safety Paper Since 2020, Here’s the DB
“I’d encourage you to steal from it. perhaps clone it in ../ and tell claude code to look at it as needed. note: a major todo for me is getting it to get comments, which it doesn’t do now.”

Working on this right now

Corm 4 Mar 2026 12:53 UTC
3 points
0
in reply to: the gears to ascension’s comment on: I Had Claude Read Every AI Safety Paper Since 2020, Here’s the DB
The original search before I went with the references based approach got a couple posts, but I think clearly not enough. I couldn’t figure out a good systematic way to get posts, but I will definitely spend some more time thinking about this and add a tag for LW/AI Alignment posts,

Corm 4 Mar 2026 12:49 UTC
1 point
0
in reply to: Robert Miles’s comment on: I Had Claude Read Every AI Safety Paper Since 2020, Here’s the DB
No it did not! I’ll take a look

Corm 3 Mar 2026 22:58 UTC
1 point
0
in reply to: Shankar Sivarajan’s comment on: White-Box Attacks on the Best Open-Weight Model: CCP Bias vs. Safety Training in Kimi K2.5
Heretic uses ablation, which requires editing all of the weight matrices. My quick assessment is that the heretic codebase as it currently exists couldn’t deal with K2.5 out of the box because K2.5 does some weird things that heretic isn’t baseline designed to handle. I do think that it would be possible to get Heretic working on K2.5 with real effort put into it. The largest Heretic’d models on HF are a fifth the size of K2.5 and it looks like they still get around ³⁰⁄₁₀₀ refusals (this is not surprising because ablation is simply harder with MoE models) compared to my 0%, although my guess is they have less KL divergence than my approach which is more of a throw the kitchen sink at the problem vibe. Heretic uses an automatic optimization process to find the best coefficients for each abliteration, Claude thinks the runpod costs for this could easily go over $100.

Corm 3 Mar 2026 22:15 UTC
3 points
0
in reply to: Hoagy’s comment on: I Had Claude Read Every AI Safety Paper Since 2020, Here’s the DB
Thanks for the suggestion! Most of https://transformer-circuits.pub/ have an arXiv and got picked up, but it seems like I missed two of them. It looks like I don’t have anything from https://distill.pub/ which I will work on.

White-Box Attacks on the Best Open-Weight Model: CCP Bias vs. Safety Training in Kimi K2.5

Corm3 Mar 2026 17:47 UTC

16 points

2 comments5 min readLW link

I Had Claude Read Every AI Safety Paper Since 2020, Here’s the DB

Corm3 Mar 2026 17:47 UTC

57 points

13 comments3 min readLW link

Side by Side Comparison of RSP Versions

Corm27 Feb 2026 21:11 UTC

18 points

0 comments1 min readLW link

Corm 1 Apr 2025 19:21 UTC
4 points
2
on: Housing Roundup #11
“Our fair city is poised to allow 6-story buildings citywide by an 8-1 vote. In context that is a huge change. Under the old rules only 350 units (!) total were expected over 15 years and 85%+ of the existing housing wouldn’t have been legal to build. Here’s a primer on the changes. They had to compromise a bit on setbacks and lot size to get it over the finish line, but it still seems great.” I wrote about the specifics of this if anyone wants to hear more/cool data visualization maps.
https://ascendantnewyork.substack.com/p/what-nyc-can-learn-from-cambridge

Corm 29 Feb 2024 16:35 UTC
1 point
0
on: Wholesomeness and Effective Altruism
what did you use to generate the images in this post?

Corm 1 Feb 2024 18:50 UTC
1 point
0
on: AI #49: Bioweapon Testing Begins
I am become Matt Levine, destination for content relevant to my interests.
You don’t even need to go to London for mundane utility, there’s and “AI Mart” in LIC.

Corm 25 Sep 2023 15:24 UTC
1 point
0
on: Honor System for Vaccination?
Have you considered doing random spot checks. Feels like even 3x per year gets 80% of the value.

Corm 29 Aug 2023 20:00 UTC
7 points
4
in reply to: awg’s comment on: Dating Roundup #1: This is Why You’re Single
Match group acquired OKCupid in 2011, ever since then it has been destroying OKCupids functionality and moving it closer and closer to a generic swipe dating app.

Corm

An­thropic is Really Push­ing the Fron­tier, What Should We Think?

101 Hu­mans of New York on the Risks of AI

InkSF, an Open­ing on Find­ing the High­est Im­pact in AI Safety and Mov­ing to SF