scasper(Stephen Casper)
Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation
The 6D effect: When companies take risks, one email can be very powerful.
Announcing the CNN Interpretability Competition
Thanks
To use your argument, what does MI actually do here?
The inspiration, I would suppose. Analogous to the type claimed in the HHH and hyena papers.
And yes to your second point.
Nice post. I think it can serve as a good example about how the hand waviness of how interpretability can help us do good things with AI goes both ways.
I’m particularly worried about MI people studying instances of when LLMs do and don’t express types of situational awareness and then someone using these insights to give LLMs much stronger situational awareness abilities.
Lastly,
On the other hand, interpretability research is probably crucial for AI alignment.
I don’t think this is true and I especially hope it is not true because (1) mechanistic interpretability still fails to do impressive things by trying to reverse engineer networks and (2) it is entirely fungible from a safety standpoint with other techniques that often do better for various things.
Several people seem to be coming to similar conclusions recently (e.g., this recent post).
I’ll add that I have as well and wrote a sequence about it :)
Thanks for the reply. This sounds reasonable to me. On the last point, I tried my best to do that here, and I think there is a relatively high ratio of solid explanations to unsolid ones.
Overall, I think that the hopes you have for interpretability research are good, and I hope it works out. One of the biggest things that I think is a concern though is that people seem to have been making similar takes with little change for 7+ years. But I just don’t think there have been a number of wins from this research that are commensurate with the effort put into it. And I assume this is expected under your views, so probably not a crux.
I get the impression of a certain of motte and bailey in this comment and similar arguments. From a high-level, the notion of better understanding what neural networks are doing would be great. The problem though seems to be that most of the SOTA of research in interpretability does not seem to be doing a good job of this in a way that seems useful for safety anytime soon. In that sense, I think this comment talks past the points that this post is trying to make.
I think this is very exciting, and I’ll look forward to seeing how it goes!
Thanks, we will consider adding each of these. We appreciate that you took a look and took the time to help suggest these!
No, I don’t think the core advantages of transparency are really unique to RLHF, but in the paper, we list certain things that are specific to RLHF which we think should be disclosed. Thanks.
Thanks, and +1 to adding the resources. Also Charbel-Raphael who authored the in-depth post is one of the authors of this paper! That post in particular was something we paid attention to during the design of the paper.
Open Problems and Fundamental Limitations of RLHF
This is exciting to see. I think this solution is impressive, and I think the case for the structure you find is compelling. It’s also nice that this solution goes a little further in one aspect than the previous one. The analysis with bars seems to get a little closer to a question I have still had since the last solution:
My one critique of this solution is that I would have liked to see an understanding of why the transformer only seems to make mistakes near the parts of the domain where there are curved boundaries between regimes (see fig above with the colored curves). Meanwhile, the network did a great job of learning the periodic part of the solution that led to irregularly-spaced horizontal bars. Understanding why this is the case seems interesting but remains unsolved.
I think this work gives a bit more of a granular idea of what might be happening. And I think it’s an interesting foil to the other one. Both came up with some fairly different pictures for the same process. The differences between these two projects seem like an interesting case study in MI. I’ll probably refer to this a lot in the future.
Overall, I think this is great, and although the challenge is over, I’m adding this to the github readme. And If you let me know a high-impact charity you’d like to support, I’ll send $500 to it as a similar prize for the challenge :)
A Short Memo on AI Interpretability Rainbows
Examples of Prompts that Make GPT-4 Output Falsehoods
Sounds right, but the problem seems to be semantic. If understanding is taken to mean a human’s comprehension, then I think this is perfectly right. But since the method is mechanistic, it seems difficult nonetheless.
Thanks—I agree that this seems like an approach worth doing. I think that at CHAI and/or Redwood there is a little bit of work at least related to this, but don’t quote me on that. In general, it seems like if you have a model and then a smaller distilled/otherwise-compressed version of it, there is a lot you can do with them from an alignment perspective. I am not sure how much work has been done in the anomaly detection literature that involves distillation/compression.
A good critical paper about potentially risky industry norms is this one.