Hi, I’m Cormac. I am currently trying to find my way towards the most impact I can have in AI existential risk.
Corm
Anthropic is Really Pushing the Frontier, What Should We Think?
“That doesn’t work; less dense in NYC generally means significantly wealthier, and NYC already skews much wealthier than average.”
I (attempted) to account for that! I went to both generally poorer and less dense neighborhoods using this neighborhood optimizer I built out: https://sladebyrd.com/canvassing-planner
Which fun fact (perhaps unsurprisingly) led to me canvassing in the neighborhood considered to be the murder capital of NYC. (Which was interestingly my single most productive canvassing session)
There is of course the issue that within a given neighborhood I am still selecting the wealthier residents. I tried adding in bias term to attempt to account for that (ie I assume I am getting responses from people that are $X above the neghborhood median), but I do agree it’s hard.
I include the collected as well as bayesian estimated demographical data of my responses at the end of the report.
101 Humans of New York on the Risks of AI
InkSF, an Opening on Finding the Highest Impact in AI Safety and Moving to SF
What Are My Values?
Okay, I finished the first pass at this: https://sladebyrd.com/ai-safety-db/posts
any thoughts?
“I’d encourage you to steal from it. perhaps clone it in ../ and tell claude code to look at it as needed. note: a major todo for me is getting it to get comments, which it doesn’t do now.”
Working on this right now
The original search before I went with the references based approach got a couple posts, but I think clearly not enough. I couldn’t figure out a good systematic way to get posts, but I will definitely spend some more time thinking about this and add a tag for LW/AI Alignment posts,
No it did not! I’ll take a look
Heretic uses ablation, which requires editing all of the weight matrices. My quick assessment is that the heretic codebase as it currently exists couldn’t deal with K2.5 out of the box because K2.5 does some weird things that heretic isn’t baseline designed to handle. I do think that it would be possible to get Heretic working on K2.5 with real effort put into it. The largest Heretic’d models on HF are a fifth the size of K2.5 and it looks like they still get around 30⁄100 refusals (this is not surprising because ablation is simply harder with MoE models) compared to my 0%, although my guess is they have less KL divergence than my approach which is more of a throw the kitchen sink at the problem vibe. Heretic uses an automatic optimization process to find the best coefficients for each abliteration, Claude thinks the runpod costs for this could easily go over $100.
Thanks for the suggestion! Most of https://transformer-circuits.pub/ have an arXiv and got picked up, but it seems like I missed two of them. It looks like I don’t have anything from https://distill.pub/ which I will work on.
White-Box Attacks on the Best Open-Weight Model: CCP Bias vs. Safety Training in Kimi K2.5
I Had Claude Read Every AI Safety Paper Since 2020, Here’s the DB
Side by Side Comparison of RSP Versions
“Our fair city is poised to allow 6-story buildings citywide by an 8-1 vote. In context that is a huge change. Under the old rules only 350 units (!) total were expected over 15 years and 85%+ of the existing housing wouldn’t have been legal to build. Here’s a primer on the changes. They had to compromise a bit on setbacks and lot size to get it over the finish line, but it still seems great.” I wrote about the specifics of this if anyone wants to hear more/cool data visualization maps.
https://ascendantnewyork.substack.com/p/what-nyc-can-learn-from-cambridge
what did you use to generate the images in this post?
I am become Matt Levine, destination for content relevant to my interests.
You don’t even need to go to London for mundane utility, there’s and “AI Mart” in LIC.
Have you considered doing random spot checks. Feels like even 3x per year gets 80% of the value.
Match group acquired OKCupid in 2011, ever since then it has been destroying OKCupids functionality and moving it closer and closer to a generic swipe dating app.
Thanks for responding, and feeling strongly enough to make an account to do so! I appreciate the feedback. I think overall a core belief of mine is building the smartest AI on the planet is extremely high stakes and I (at least) would hope and have high standards for what it looks like to steward that into existence. This bias is certainly baked into this essay and I think it’s reasonable that if you (or anyone else) doesn’t have that frame this essay is less strong, since it’s less important really actually get everything right.
I think there’s a couple things going on here that I appreciate the feedback on. There are certainly beliefs of mine that I don’t think I fully justify within the course of this article. I’ve written a little bit (and thought a lot) about the previous version of the RSP and I certainly agree that
does not clearly support that belief of mine. I generally try to make it clear when something isn’t fully supported in the essay by saying something like “I believe” but clearly I didn’t do a great job here.
Another thing I struggled with here was how much to write for someone who already has some context vs. really describing everything exactly and fully. I also struggled with this in the ordering of the various points since frequently they are interrelated.
I will note that I did make it very clear that it wasn’t a public release and estimated the total number of people I think have access. How bad is it if someone who stops before they have made it 25% of the way through an essay has an incorrect assessment of reality. I certainly agree it’s not optimal, but it seems kind of impossible to fully solve. Perhaps it’s sufficiently important in this case to footnote it immediately—that seems low cost and sufficiently worth it. I’ll do that.
I clearly didn’t do a good enough job here of describing the mechanism. This is not at all a claim about the model itself, but that they are choosing to not include the Classifier-based prompt blocking that is normally included in public releases. I do say this in the very next line after the one you quoted: “Because of the very limited and targeted nature of this release, we are not blocking exchanges based on classifier triggers”, so this feels a little uncharitable. But, also I think if I want to write for non technical people not in this field I can’t expect them to know what classifier-based prompt blocking is. I do try to define this later in the essay, but once again ordering is hard
The point I clearly mostly failed to make here is that it can’t be blocked by the RSP. The v3 RSP is specifically designed to not block new releases (unlike previous versions)
I will say overall I am a little sad that this critique doesn’t engage with any of the technical arguments which (at least in my eyes) are where I am most unhappy with the state of this release and what can be inferred about how future releases will look. I also understand why that is the case, technical arguments are much harder to assess and know how important they are if you aren’t in the field.
Once again just trying to do my best, and I appreciate the places you’ve spent the time to give feedback on how you read and understood this piece!