scasper(Stephen Casper)

Karma: 1,565

https://stephencasper.com/

scasper 26 Mar 2024 16:05 UTC
LW: 3 AF: 1
0
AF
on: Third-party testing as a key ingredient of AI policy
Thanks for the useful post. There are a lot of things to like about this. Here are a few questions/comments.

First, I appreciate the transparency around this point.
we advocate for what we see as the ‘minimal viable policy’ for creating a good AI ecosystem, and we will be open to feedback.
Second, I have a small question. Why the use of the word “testing” instead of “auditing”? In my experience, most of the conversations lately about this kind of thing revolve around the term “auditing.”
Third, I wanted to note that this post does not talk about access levels for testers and ask why this is the case. My thoughts here relate to this recent work. I would have been excited to see (1) a commitment to providing auditors with methodological details, documentation, and data or (2) a commitment to working on APIs and SREs that can allow for auditors to be securely given grey- and white-box access.

scasper 14 Mar 2024 16:35 UTC
3 points
0
in reply to: David Gross’s comment on: I was raised by devout Mormons, AMA &| Offer Advice
I think one thing that’s pretty cool is “home teaching.” Mormon congregation members who are able are assigned various other members of the congregation to check in on. This often involves a monthly visit to their house, talking for a bit, sharing some spiritual thoughts, etc. The nice thing about it is that home teaching sometimes really benefits people who need it. Especially for old or disabled people, they get nice visits, and home teachers often help them with stuff. In my experience, Mormons within a congregation are pretty good at helping each other with misc. things in life (e.g. fixing a broken water heater), and this is largely done through home teaching.

scasper 22 Feb 2024 23:47 UTC
LW: 8 AF: 5
3
AF
in reply to: Rohin Shah’s comment on: Analogies between scaling labs and misaligned superintelligent AI
Thanks. I agree that the points apply to individual researchers. But I don’t think that it applies in a comparably worrisome way because individual researchers do not have comparable intelligence, money, and power compared to the labs. This is me stressing the “when put under great optimization pressure” of Goodhart’s Law. Subtle misalignments are much less dangerous when there is a weak optimization force behind the proxy than when there is a strong one.

scasper 21 Feb 2024 19:59 UTC
LW: 6 AF: 4
4
AF
on: Analogies between scaling labs and misaligned superintelligent AI
See also this much older and closely related post by Thomas Woodside: Is EA an advanced, planning, strategically-aware power-seeking misaligned mesa-optimizer?

Analogies between scaling labs and misaligned superintelligent AI

scasper21 Feb 2024 19:29 UTC

72 points

4 comments4 min readLW link

scasper 26 Jan 2024 1:54 UTC
LW: 3 AF: 2
0
AF
in reply to: ryan_greenblatt’s comment on: The case for ensuring that powerful AIs are controlled
Thanks for the reply. I think this resolves the thread and puts us on the same page :)

scasper 26 Jan 2024 0:31 UTC
LW: 6 AF: 5
1
AF
on: The case for ensuring that powerful AIs are controlled
I have been thinking a lot lately about evals and what differences black- vs. white-box access makes for evaluators. I was confused about the appendix on black-box evals. My best attempt to make sense of it hinges on two instances of the authors intending something different than what was written.
First, the appendix hinges somewhat on this point.
For now, we think the easiest way to build control techniques that avoid making incorrect assumptions about scheming models is to make conservative assumptions about their internals, by rejecting any strategy which can be evaded by a red-team-constructed model.
I agree with this, but saying that “control evals should be black-box” is a non sequitur. This is making the point that bad reverse-engineering of the model is bad and seems to assume that good reverse-engineering is hard/unlikely implicitly.
Second, even if good reverse-engineering is hard/unlikely, reverse-engineering is not the definition of a white-box eval. There are a lot of things that you can do with white-box access for evals other than interpretability-based things: gradient-based attacks, gradient-guided attacks, latent-space attacks, and finetuning—and all of these things actually seem plausibly useful for control evals. Plus I don’t see why interp should be off limits for evaluators who know the limitations of the tools they are using.
Ultimately, white-box attacks offer strictly more options for any type of eval compared to black-box ones. For any property that an AI system might have, if there are two teams of competent people trying to verify if the system has that property, all else equal, I’d bet on the success of the team that has more access.
So if I understand things correctly, and if the intended point of this appendix is that “Interpretability-based evals for control seem unreliable and possibly could lead to pitfalls,” I would recommend just saying that more directly.

scasper 9 Dec 2023 3:56 UTC
LW: 2 AF: 1
0
AF
in reply to: Gabe M’s comment on: Deep Forgetting & Unlearning for Safely-Scoped LLMs
Thanks!

I intuit that what you mentioned as a feature might also be a bug. I think that practical forgetting/unlearning that might make us safer would probably involve subjects of expertise like biotech. And if so, then we would want benchmarks that measure a method’s ability to forget/unlearn just the things key to that domain and nothing else. For example, if a method succeeds in unlearning biotech but makes the target LM also unlearn math and physics, then we should be concerned about that, and we probably want benchmarks to help us quantify that.

I could imagine an unlearning benchmark, for example, with $n$ textbooks and $n$ ap tests. Then for each of $k$ different knowledge-recovery strategies, one could construct the $n \times n$ grid of how well the model performs on each target test for each unlearning textbook.

scasper 6 Dec 2023 4:20 UTC
LW: 4 AF: 2
0
AF
in reply to: Thomas Kwa’s comment on: Deep Forgetting & Unlearning for Safely-Scoped LLMs
+1, I’ll add this and credit you.

scasper 5 Dec 2023 21:09 UTC
7 points
5
in reply to: aogara’s comment on: Deep Forgetting & Unlearning for Safely-Scoped LLMs
+1
Although the NeurIPS challenge and prior ML lit on forgetting and influence functions seem worth keeping on the radar because they’re still closely-related to challenges here.

scasper 5 Dec 2023 20:10 UTC
3 points
0
in reply to: Bogdan Ionut Cirstea’s comment on: Deep Forgetting & Unlearning for Safely-Scoped LLMs
Thanks! I edited the post to add a link to this.

Deep Forgetting & Unlearning for Safely-Scoped LLMs

scasper5 Dec 2023 16:48 UTC

109 points

29 comments13 min readLW link

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

Soroush Pour, rusheb, Quentin FEUILLADE--MONTIXI, Arush and scasper

7 Nov 2023 17:59 UTC

36 points

2 comments2 min readLW link

(arxiv.org)

scasper 4 Nov 2023 23:59 UTC
5 points
0
in reply to: lukehmiles’s comment on: The 6D effect: When companies take risks, one email can be very powerful.
A good critical paper about potentially risky industry norms is this one.

The 6D effect: When companies take risks, one email can be very powerful.

scasper4 Nov 2023 20:08 UTC

261 points

40 comments3 min readLW link

Announcing the CNN Interpretability Competition

scasper26 Sep 2023 16:21 UTC

22 points

0 comments4 min readLW link

scasper 21 Sep 2023 6:37 UTC
6 points
0
in reply to: LawrenceC’s comment on: Interpretability Externalities Case Study—Hungry Hungry Hippos
Thanks
To use your argument, what does MI actually do here?
The inspiration, I would suppose. Analogous to the type claimed in the HHH and hyena papers.
And yes to your second point.

scasper 20 Sep 2023 16:23 UTC
13 points
6
on: Interpretability Externalities Case Study—Hungry Hungry Hippos
Nice post. I think it can serve as a good example about how the hand waviness of how interpretability can help us do good things with AI goes both ways.
I’m particularly worried about MI people studying instances of when LLMs do and don’t express types of situational awareness and then someone using these insights to give LLMs much stronger situational awareness abilities.
Lastly,
On the other hand, interpretability research is probably crucial for AI alignment.
I don’t think this is true and I especially hope it is not true because (1) mechanistic interpretability still fails to do impressive things by trying to reverse engineer networks and (2) it is entirely fungible from a safety standpoint with other techniques that often do better for various things.

scasper 29 Aug 2023 21:50 UTC
LW: 2 AF: 2
1
AF
on: Barriers to Mechanistic Interpretability for AGI Safety
Several people seem to be coming to similar conclusions recently (e.g., this recent post).
I’ll add that I have as well and wrote a sequence about it :)

scasper 18 Aug 2023 16:37 UTC
5 points
3
in reply to: Richard_Ngo’s comment on: Against Almost Every Theory of Impact of Interpretability
Thanks for the reply. This sounds reasonable to me. On the last point, I tried my best to do that here, and I think there is a relatively high ratio of solid explanations to unsolid ones.
Overall, I think that the hopes you have for interpretability research are good, and I hope it works out. One of the biggest things that I think is a concern though is that people seem to have been making similar takes with little change for 7+ years. But I just don’t think there have been a number of wins from this research that are commensurate with the effort put into it. And I assume this is expected under your views, so probably not a crux.

scasper(Stephen Casper)

Analo­gies be­tween scal­ing labs and mis­al­igned su­per­in­tel­li­gent AI

Deep For­get­ting & Un­learn­ing for Safely-Scoped LLMs

Scal­able And Trans­fer­able Black-Box Jailbreaks For Lan­guage Models Via Per­sona Modulation

The 6D effect: When com­pa­nies take risks, one email can be very pow­er­ful.

An­nounc­ing the CNN In­ter­pretabil­ity Competition

Analogies between scaling labs and misaligned superintelligent AI

Deep Forgetting & Unlearning for Safely-Scoped LLMs

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

The 6D effect: When companies take risks, one email can be very powerful.

Announcing the CNN Interpretability Competition