scasper

Karma: 2,034

https://stephencasper.com/

scasper Jul 31, 2023, 9:41 PM
6 points
6
in reply to: Stephen McAleese’s comment on: Open Problems and Fundamental Limitations of RLHF
Thanks, and +1 to adding the resources. Also Charbel-Raphael who authored the in-depth post is one of the authors of this paper! That post in particular was something we paid attention to during the design of the paper.

Open Problems and Fundamental Limitations of RLHF

scasperJul 31, 2023, 3:31 PM

66 points

6 comments2 min readLW link

(arxiv.org)

scasper Jul 30, 2023, 11:55 PM
5 points
1
in reply to: RGRGRG’s comment on: Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

scasper Jul 29, 2023, 10:58 PM
4 points
2
on: Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)
This is exciting to see. I think this solution is impressive, and I think the case for the structure you find is compelling. It’s also nice that this solution goes a little further in one aspect than the previous one. The analysis with bars seems to get a little closer to a question I have still had since the last solution:
My one critique of this solution is that I would have liked to see an understanding of why the transformer only seems to make mistakes near the parts of the domain where there are curved boundaries between regimes (see fig above with the colored curves). Meanwhile, the network did a great job of learning the periodic part of the solution that led to irregularly-spaced horizontal bars. Understanding why this is the case seems interesting but remains unsolved.
I think this work gives a bit more of a granular idea of what might be happening. And I think it’s an interesting foil to the other one. Both came up with some fairly different pictures for the same process. The differences between these two projects seem like an interesting case study in MI. I’ll probably refer to this a lot in the future.
Overall, I think this is great, and although the challenge is over, I’m adding this to the github readme. And If you let me know a high-impact charity you’d like to support, I’ll send $500 to it as a similar prize for the challenge :)

A Short Memo on AI Interpretability Rainbows

scasperJul 27, 2023, 11:05 PM

18 points

0 comments2 min readLW link

Examples of Prompts that Make GPT-4 Output Falsehoods

scasper and Luke Bailey

Jul 22, 2023, 8:21 PM

21 points

5 comments6 min readLW link

scasper Jul 22, 2023, 7:11 PM
LW: 1 AF: 1
2
AF
in reply to: davidad’s comment on: Seven Strategies for Tackling the Hard Part of the Alignment Problem
Sounds right, but the problem seems to be semantic. If understanding is taken to mean a human’s comprehension, then I think this is perfectly right. But since the method is mechanistic, it seems difficult nonetheless.

scasper Jul 21, 2023, 4:58 PM
LW: 2 AF: 2
1
AF
in reply to: lberglund’s comment on: Seven Strategies for Tackling the Hard Part of the Alignment Problem
Thanks—I agree that this seems like an approach worth doing. I think that at CHAI and/or Redwood there is a little bit of work at least related to this, but don’t quote me on that. In general, it seems like if you have a model and then a smaller distilled/otherwise-compressed version of it, there is a lot you can do with them from an alignment perspective. I am not sure how much work has been done in the anomaly detection literature that involves distillation/compression.

scasper Jul 13, 2023, 5:03 PM
1 point
0
in reply to: Oliver Daniels’s comment on: Seven Strategies for Tackling the Hard Part of the Alignment Problem
I don’t work on this, so grain of salt.
But wouldn’t this take the formal out of formal verification? If so, I am inclined to think about this as a form of ambitious mechanistic interpretability.

scasper Jul 12, 2023, 3:44 PM
LW: 1 AF: 1
0
AF
in reply to: Charlie Steiner’s comment on: Seven Strategies for Tackling the Hard Part of the Alignment Problem
I think this is a good point, thanks.

Eight Strategies for Tackling the Hard Part of the Alignment Problem

scasperJul 8, 2023, 6:55 PM

42 points

11 comments7 min readLW link

scasper Jun 10, 2023, 6:53 PM
1 point
0
in reply to: dr_s’s comment on: Takeaways from the Mechanistic Interpretability Challenges
There are existing tools like lucid/lucent, captum, transformerlens, and many others that make it easy to use certain types of interpretability tools. But there is no standard, broad interpretability coding toolkit. Given the large number of interpretability tools and how quickly methods become obsolete, I don’t expect one.

scasper Jun 10, 2023, 6:50 PM
5 points
−1
in reply to: Sheikh Abdur Raheem Ali’s comment on: Takeaways from the Mechanistic Interpretability Challenges
Thoughts of mine on this are here. In short, I have argued that toy problems, cherry-picking models/tasks, and a lack of scalability has contributed to mechanistic interpretability being relatively unproductive.

Takeaways from the Mechanistic Interpretability Challenges

scasperJun 8, 2023, 6:56 PM

94 points

5 comments6 min readLW link

scasper Jun 2, 2023, 9:16 PM
6 points
4
in reply to: Shmi’s comment on: Advice for Entering AI Safety Research
I think not. Maybe circuits-style mechanistic interpretability is though. I generally wouldn’t try dissuading people from getting involved in research on most AIS things.

Advice for Entering AI Safety Research

scasperJun 2, 2023, 8:46 PM

26 points

2 comments5 min readLW link

scasper May 25, 2023, 4:26 PM
LW: 1 AF: 1
0
AF
in reply to: Xander Davies’s comment on: EIS IX: Interpretability and Adversaries
We talked about this over DMs, but I’ll post a quick reply for the rest of the world. Thanks for the comment.
A lot of how this is interpreted depends on what the exact definition of superposition that one uses and whether it applies to entire networks or single layers. But a key thing I want to highlight is that if a layer represents a certain set amount of information about an example, then they layer must have more information per neuron if it’s thin than if it’s wide. And that is the point I think that the Huang paper helps to make. The fact that deep and thin networks tend to be more robust suggests that representing information more densely w.r.t. neurons in a layer does not make these networks less robust than wide shallow nets.

scasper May 25, 2023, 4:20 PM
LW: 1 AF: 1
0
AF
in reply to: Adam Jermyn’s comment on: EIS IX: Interpretability and Adversaries
Thanks, +1 to the clarification value of this comment. I appreciate it. I did not have the tied weights in mind when writing this.

scasper May 5, 2023, 7:15 PM
LW: 2 AF: 2
1
AF
in reply to: Richard_Ngo’s comment on: EIS V: Blind Spots In AI Safety Interpretability Research
Thanks for the comment.
In general I think that having a deep understanding of small-scale mechanisms can pay off in many different and hard-to-predict ways.
This seems completely plausible to me. But I think that it’s a little hand-wavy. In general, I perceive the interpretability agendas that don’t involve applied work to be this way. Also, few people would argue that basic insights, to the extent that they are truly explanatory, can be valuable. But I think it is at least very non-obvious that it would be differentiably useful for safety.
there are a huge number of cases in science where solving toy problems has led to theories that help solve real-world problems.
No qualms here. But (1) the point about program synthesis/induction/translation suggests that the toy problems are fundamentally more tractable than real ones. Analogously, imagine saying that having humans write and study simple algorithms for search, modular addition, etc. to be part of an agenda for program synthesis. (2) At some point the toy work should lead to competitive engineering work. think that there has not been a clear trend toward this in the past 6 years with the circuits agenda.
I can kinda see the intuition here, but could you explain why we shouldn’t expect this to generalize?
Thanks for the question. It might generalize. My intended point with the Ramanujan paper is that a subnetwork seeming to do something in isolation does not mean that it does that thing in context. The Ramanujan et al. weren’t interpreting networks, they were just training the networks. So the underlying subnetworks may generalize well, but in this case, this is not interpretability work any more than just gradient-based training of a sparse network is.

scasper Apr 14, 2023, 8:03 PM
2 points
1
in reply to: M. Y. Zuo’s comment on: GPT-4 is easily controlled/exploited with tricky decision theoretic dilemmas.
I just went by what it said. But I agree with your point. It’s probably best modeled as a predictor in this case—not an agent.

scasper

Open Prob­lems and Fun­da­men­tal Limi­ta­tions of RLHF

A Short Memo on AI In­ter­pretabil­ity Rain­bows

Ex­am­ples of Prompts that Make GPT-4 Out­put Falsehoods

Eight Strate­gies for Tack­ling the Hard Part of the Align­ment Problem

Take­aways from the Mechanis­tic In­ter­pretabil­ity Challenges

Ad­vice for En­ter­ing AI Safety Research