Progress Report 5: tying it together

Previous: Progress report 4

First, an example of what not to do if you want humanity to survive: make an even foom-ier and less interpretable version of neural nets. On the spectrum of good idea to bad idea, this one is way worse than neuromorphic computing. In fact, they even had a paragraph in their paper discussing their method in contrast to Hebbian learning, and showing how their method is more volatile and unpredictable. Great. https://arxiv.org/abs/2202.05780

On the flip side of the coin, here’s some good stuff that I’m happy to se being worked on.

Some other work that has been done with nostalgebraist’s logit-lens: Looking for Grammar in all the Right Places Alethea Power

Tom Frederik working on a projected called ‘Unseal’ https://github.com/TomFrederik/unseal (to unseal the mystery of transformers), expanding on the PySvelte library to include aspects of the logit-lens idea.

Inspired by this, I started my own fork from his library to work on a interactive visualization that creates an open-ended framework for tying together multiple ways of interpreting models. I’m hoping to use it to put together the stuff I’ve found or made so far, as well as whatever I find or make next. It just seems generally useful to have some sort of way to put a bunch of different ‘views’ of the model together side-by-side in order to scan across all of them to get a more inclusive view of things.

Once I get this tool a bit more fleshed out, I plan to start trying to plan the audit game with it, using toy models that have been edited.

Here’s a presentation of my ideas I made for the conclusion of the AGI safety fundamentals course. https://youtu.be/FarRgPwBpGU

Among the things I’m thinking of putting in are ideas related to these papers.

One: A paper mentioned by jsteinhardt here https://www.lesswrong.com/posts/qAhT2qvKXboXqLk4e/early-2022-paper-round-up : Summarizing Differences between Text Distributions with Natural Language (w/ Ruiqi Zhong, Charlie Snell, Dan Klein)

This paper discusses their language data summarization technique, which I think will be cool to use someday, but also along the way to building that they had to do some clustering which I think sounds like a useful thing to include in my visualization and contrast with the neuron-importance topic clustering. I hope to also revisit and improve on the neuron-importance clustering. I think if I repeat the neuron-importance sampling on topic-labeled samples, I’ll then be able to tag the resulting clusters with the topics they most strongly relate to. That will make them more useful for interpretation.

Two: https://arxiv.org/abs/2203.14680 Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space Mor Geva, Avi Caciularu, Kevin Ro Wang, Yoav Goldberg

Third:

What does BERT dream of? Deep dream with text https://www.gwern.net/docs/www/pair-code.github.io/c331351a690011a2a37f7ee1c75bf771f01df3a3.html

Seems neat and sorta related. I can probably figure out some way to add a version of this text-deep-dreaming to the laundry list of ‘windows into interpretability’ I’m accumulating.