scasper

Karma: 2,041

4 Jul 2025 0:07 UTC

13 points

1 comment4 min readLW link

(far.ai)

scasper 3 Jul 2025 23:13 UTC
11 points
0
on: Call for suggestions—AI safety course
In addition to Elad Hazan’s, I am aware of similar courses from Roger Grosse (Toronto) and Max Lamparth et al (Stanford).

scasper 24 Mar 2025 2:16 UTC
LW: 5 AF: 4
0
AF
in reply to: Steven Byrnes’s comment on: Reframing AI Safety as a Neverending Institutional Challenge
I’m glad you think that the post has a small audience and may not be needed. I suppose that’s a good sign.

—

In the post I said it’s good that nukes don’t blow up on accident and similarly, it’s good that BSL-4 protocols and tech exist. I’m not saying that alignment solutions shouldn’t exist. I am speaking to a specific audience (e.g. the frontier companies and allies) that their focus on alignment isn’t commensurate with its usefulness. Also don’t forget the dual nature of alignment progress. I also mentioned in the post that frontier alignment progress hastens timelines and makes misuse risk more acute.

scasper 24 Mar 2025 2:00 UTC
LW: 4 AF: 3
0
AF
in reply to: habryka’s comment on: Reframing AI Safety as a Neverending Institutional Challenge
My bad. Didn’t mean to imply you thought it was desirable.

scasper 23 Mar 2025 20:57 UTC
LW: 11 AF: 6
2
AF
in reply to: habryka’s comment on: Reframing AI Safety as a Neverending Institutional Challenge
There’s a crux here somewhere related to the idea that, with high probability, AI will be powerful enough and integrated into the world in such a way that it will be inevitable or desirable for normal human institutions to eventually lose control and for some small regime to take over the world. I don’t think this is very likely for reasons discussed in the post, and it’s also easy to use this kind of view to justify some pretty harmful types of actions.

scasper 23 Mar 2025 20:27 UTC
LW: 5 AF: 4
2
AF
in reply to: Steven Byrnes’s comment on: Reframing AI Safety as a Neverending Institutional Challenge
Thx!
I disagree that people working on the technical alignment problem generally believe that solving that technical problem is sufficient to get to Safe & Beneficial AGI.
I won’t put words in people’s mouth, but it’s not my goal to talk about words. I think that large portions of the AI safety community act this way. This includes most people working on scalable alignment, interp, and deception.

If we don’t solve the technical alignment problem, then we’ll eventually wind up with a recipe for summoning more and more powerful demons with callous lack of interest in whether humans live or die
Yeah, I don’t really agree with the idea that getting better at alignment is necessary for safety. I think that it’s more likely than not that we’re already sufficiently good at it. The paragraph titled: “If AI causes a catastrophe, what are the chances that it will be triggered by the choices of people who were exercising what would be considered to be “best safety practices” at the time?” gives my thoughts on this.

Reframing AI Safety as a Neverending Institutional Challenge

scasper23 Mar 2025 0:13 UTC

53 points

12 comments5 min readLW link

scasper 19 Mar 2025 13:28 UTC
LW: 2 AF: 1
0
AF
in reply to: Neel Nanda’s comment on: EIS XV: A New Proof of Concept for Useful Interpretability
Yea, thanks, good point. On one hand, I am assuming that the after identifying the SAE neurons of interest, they could be used for steering (this is related to section 5.3.2 of the paper). On the other hand, I am also assuming that in this case, identifying the problem is 80% of the challenge. IRL, I would assume that this kind of problem could and would be addressed by adversarial fine-tuning the model some more.

EIS XV: A New Proof of Concept for Useful Interpretability

scasper17 Mar 2025 20:05 UTC

30 points

2 comments3 min readLW link

scasper 12 Oct 2024 17:26 UTC
LW: 5 AF: 1
0
AF
in reply to: Rohin Shah’s comment on: EIS XIV: Is mechanistic interpretability about to be practically useful?
Thanks for the comment. My probabilities sum to 165%, which would translate to me saying that I would expect, on average, the next paper to do 1.65 things from the list, which to me doesn’t seem too crazy. I think that this DOES match my expectations.
But I also think you make a good point. If the next paper comes out, does only one thing, and does it really well, I commit to not complaining too much.

EIS XIV: Is mechanistic interpretability about to be practically useful?

scasper11 Oct 2024 22:13 UTC

68 points

4 comments7 min readLW link

scasper 27 Aug 2024 19:46 UTC
LW: 8 AF: 5
2
AF
on: Latent Adversarial Training
Some relevant papers to anyone spelunking around this post years later:
https://arxiv.org/abs/2403.05030
https://arxiv.org/abs/2407.15549

Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?

scasper30 Jul 2024 14:57 UTC

25 points

0 comments4 min readLW link

scasper 22 May 2024 4:21 UTC
LW: 15 AF: 7
9
AF
in reply to: Cody Rushing’s comment on: EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024
Thanks, I think that these points are helpful and basically fair. Here is one thought, but I don’t have any disagreements.
Olah et al. 100% do a good job of noting what remains to be accomplished and that there is a lot more to do. But when people in the public or government get the misconception that mechanistic interpretability has been (or definitely will be) solved, we have to ask where this misconception came from. And I expect that claims like “Sparse autoencoders produce interpretable features for large models” contribute to this.

scasper 21 May 2024 22:15 UTC
LW: 24 AF: 13
9
AF
in reply to: Zack_M_Davis’s comment on: EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024
Thanks for the comment. I think the experiments you mention are good (why I think the paper met 3), but I don’t think that its competitiveness has been demonstrated (why I think the paper did not meet 6 or 10). I think there are two problems.
First, is that it’s under a streetlight. Ideally, there would be an experiment that began with a predetermined set of edits (e.g., one from Meng et al., 2022) and then used SAEs to perform them.
Second, there’s no baseline that SAE edits are compared to. There are lots of techniques from the editing, finetuning, steering, rep-E, data curation, etc. literatures that people use to make specific changes to models’ behaviors. Ideally, we’d want SAEs to be competitive with them. Unfortunately, good comparisons would be hard because using SAEs for editing models is a pretty unique method with lots of compute required upfront. This would make it non-straightforward to compare the difficulty of making different changes with different methods, but it does not obviate the need for baselines.

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

scasper21 May 2024 20:15 UTC

157 points

16 comments3 min readLW link

scasper 26 Mar 2024 16:05 UTC
LW: 3 AF: 1
0
AF
on: Third-party testing as a key ingredient of AI policy
Thanks for the useful post. There are a lot of things to like about this. Here are a few questions/comments.

First, I appreciate the transparency around this point.
we advocate for what we see as the ‘minimal viable policy’ for creating a good AI ecosystem, and we will be open to feedback.
Second, I have a small question. Why the use of the word “testing” instead of “auditing”? In my experience, most of the conversations lately about this kind of thing revolve around the term “auditing.”
Third, I wanted to note that this post does not talk about access levels for testers and ask why this is the case. My thoughts here relate to this recent work. I would have been excited to see (1) a commitment to providing auditors with methodological details, documentation, and data or (2) a commitment to working on APIs and SREs that can allow for auditors to be securely given grey- and white-box access.

scasper 14 Mar 2024 16:35 UTC
3 points
0
in reply to: David Gross’s comment on: I was raised by devout Mormons, AMA &| Offer Advice
I think one thing that’s pretty cool is “home teaching.” Mormon congregation members who are able are assigned various other members of the congregation to check in on. This often involves a monthly visit to their house, talking for a bit, sharing some spiritual thoughts, etc. The nice thing about it is that home teaching sometimes really benefits people who need it. Especially for old or disabled people, they get nice visits, and home teachers often help them with stuff. In my experience, Mormons within a congregation are pretty good at helping each other with misc. things in life (e.g. fixing a broken water heater), and this is largely done through home teaching.

scasper 22 Feb 2024 23:47 UTC
LW: 8 AF: 5
3
AF
in reply to: Rohin Shah’s comment on: Analogies between scaling labs and misaligned superintelligent AI
Thanks. I agree that the points apply to individual researchers. But I don’t think that it applies in a comparably worrisome way because individual researchers do not have comparable intelligence, money, and power compared to the labs. This is me stressing the “when put under great optimization pressure” of Goodhart’s Law. Subtle misalignments are much less dangerous when there is a weak optimization force behind the proxy than when there is a strong one.

scasper 21 Feb 2024 19:59 UTC
LW: 6 AF: 4
4
AF
on: Analogies between scaling labs and misaligned superintelligent AI
See also this much older and closely related post by Thomas Woodside: Is EA an advanced, planning, strategically-aware power-seeking misaligned mesa-optimizer?

scasper

Lay­ered AI Defenses Have Holes: Vuln­er­a­bil­ities and Key Recommendations

Refram­ing AI Safety as a Nev­erend­ing In­sti­tu­tional Challenge

EIS XV: A New Proof of Con­cept for Use­ful Interpretability

EIS XIV: Is mechanis­tic in­ter­pretabil­ity about to be prac­ti­cally use­ful?

Can Gen­er­al­ized Ad­ver­sar­ial Test­ing En­able More Ri­gor­ous LLM Safety Evals?

EIS XIII: Reflec­tions on An­thropic’s SAE Re­search Circa May 2024

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

Reframing AI Safety as a Neverending Institutional Challenge

EIS XV: A New Proof of Concept for Useful Interpretability

EIS XIV: Is mechanistic interpretability about to be practically useful?

Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024