RGRGRG

Karma: 61

Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

RGRGRG28 Jul 2023 20:44 UTC

23 points

5 comments20 min readLW link

RGRGRG 22 Sep 2023 0:48 UTC
10 points
0
on: There should be more AI safety orgs
For any potential funders reading this: I’d be open to starting an interpretability lab and would love to chat. I’ve been full-time on MI for about 4 months—here is some of my work: https://www.lesswrong.com/posts/vGCWzxP8ccAfqsrS3/thoughts-about-the-mechanistic-interpretability-challenge-2
I have a few PhD friends who are working for software jobs they don’t like and would be interested in joining me for a year or longer if there were funding in place (even for just the trial period Marius proposes).
My very quick take is that interpretability has yet to understand small language models and this is a valuable direction to focus on next. (more details here: https://www.lesswrong.com/posts/ESaTDKcvGdDPT57RW/seeking-feedback-on-my-mechanistic-interpretability-research )
For any potential cofounders reading this, I have applied to a few incubators and VC funds, without any success. I think some applications would be improved if I had a co-founder. If you are potentially interested in cofounding an interpretability startup and you live in the Bay Area, I’d love to meet for coffee and see if we have a shared vision and potentially apply to some of these incubators together.

[Question] Best Ways to Try to Get Funding for Alignment Research?

RGRGRG4 Apr 2023 6:35 UTC

9 points

6 comments1 min readLW link

RGRGRG 20 Jul 2023 3:40 UTC
8 points
0
in reply to: Ariel Kwiatkowski’s comment on: Alignment Grantmaking is Funding-Limited Right Now
I wonder if this is simply the result of the generally bad SWE/CS market right now. People who would otherwise be in big tech/other AI stuff, will be more inclined to do something with alignment.
This is roughly my situation. Waymo froze hiring and had layoffs while continuing to increase output expectations. As a result I/we had more work. I left in March to explore AI and landed on Mechanistic Interpretability research.

RGRGRG 31 Mar 2023 3:42 UTC
6 points
0
on: Nobody’s on the ball on AGI alignment
What is the best way (assuming one exists), as an independent researcher, with a PhD in AI but not in ML, to get funding to do alignment work? (I recently left my big tech job).

RGRGRG 24 Aug 2023 22:19 UTC
5 points
2
on: Why Is No One Trying To Align Profit Incentives With Alignment Research?
Over the last 3 months, I’ve spent some time thinking about mech interp as a for profit service. I’ve pitched to one VC firm, interviewed for a few incubators/accelerators including ycombinator, sent out some pitch documents, co-founder dated a few potential cofounders, and chatted with potential users and some AI founders).
There are a few issues:
First, as you mention, I’m not sure if mech interp is yet ready to understand models. I recently interpreted a 1-layer model trained on a binary classification function https://www.lesswrong.com/posts/vGCWzxP8ccAfqsrS3/thoughts-about-the-mechanistic-interpretability-challenge-2 and am currently working on understanding a 1-layer language model (TinyStories-1Layer-21M). TinyStories is (much?) harder than the binary classification network (which took 24 focused days of solo research). This isn’t to say I or someone else won’t have an idea how 1 layer models work a few months from now. Once this happens, we might want to interpret multi-layer models before being ready to interpret models that are running in production.
Second, outsiders can observe that mech interp might not be far enough along to build a product around. The feedback I received from the VC firm and YC was that my ideas weren’t far enough along.
Third, I personally have not yet been able to find someone I’m excited to be cofounders with. Some people have different visions in terms of safety (some people just don’t care at all). Other people who I share a vision with, I don’t match with for other reasons.
Fourth, I’m not certain that I’ve yet found that ideal first customer—some people seem to think it’s nice to have, but frequently with language models, if you get a bad output, you can just run it again (keeping a human in the loop). To be clear, I haven’t given up on finding that ideal customer, and it could be something like government or that customer might not exist until AI models do something really bad.
Fifth, I’m unsure if I actually want to run a company. I love doing interp research and think I am quite good at it (among other things, having a software background, a PhD in Robotics, and solving puzzles). I consider myself a 10x+ engineer. At least right now, it seems like I can add more value by doing independent research rather than running a company.
For me, the first issue is the main one. Once interp is farther along, I’m open to put more time into thinking about the other issues. If anyone reading this is potentially interested in chatting, feel free to DM me.

RGRGRG 13 Oct 2023 14:34 UTC
4 points
0
on: LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
Do you expect similar results (besides the fact that it would take longer to train / cost more) without using LoRA?

Seeking Feedback on My Mechanistic Interpretability Research Agenda

RGRGRG12 Sep 2023 18:45 UTC

3 points

1 comment3 min readLW link

RGRGRG 15 Aug 2023 3:16 UTC
2 points
0
in reply to: AdamYedidia’s comment on: The positional embedding matrix and previous-token heads: how do they actually work?
Thank you! I’m still surprised how little most heads in L0 + L1 seem to be using the positional embeddings. L1H4 looks reasonably uniform so I could accept that maybe that somehow feeds into L2H2.

RGRGRG 12 Aug 2023 19:19 UTC
LW: 2 AF: 1
0
AF
on: The positional embedding matrix and previous-token heads: how do they actually work?
This is a surprising and fascinating result. Do you have attention plots of all 144 heads you could share?

I’m particularly interested in the patterns for all heads on layers 0 and 1 matching the following caption
(Left: a 50x50 submatrix of LXHY’s attention pattern on a prompt from openwebtext-10k. Right: the same submatrix of LXHY’s attention pattern, when positional embeddings are averaged as described above.)

RGRGRG 4 May 2024 18:44 UTC
1 point
0
on: Mechanistically Eliciting Latent Behaviors in Language Models
Enjoyed this post! Quick question about obtaining the steering vectors:
Do you train them one at a time, possibly adding an additional orthogonality constraint between each train?

RGRGRG 4 May 2024 18:23 UTC
1 point
0
on: Transcoders enable fine-grained interpretable circuit analysis for language models
Question about the “rules of the game” you present. Are you allowed to simply look at layer 0 transcoder features for the final 10 tokens—you could probably roughly estimate the input string from these features’ top activators. From you case study, it seems that you effectively look at layer 0 transcoder features for a few of the final tokens through a backwards search, but wonder if you can skip the search and simply look at transcoder features. Thank you.

RGRGRG 9 Dec 2023 22:48 UTC
LW: 1 AF: 1
0
AF
on: Finding Sparse Linear Connections between Features in LLMs
To confirm—the weights you share, such as 0.26 and 0.23 are each individual entries in the W matrix for:
y=Wx ?

RGRGRG 11 Nov 2023 17:03 UTC
1 point
0
on: Growth and Form in a Toy Model of Superposition
This is a casual thought and by no means something I’ve thought hard about—I’m curious whether b is a lagging indicator, which is to say, there’s actually more magic going on in the weights and once weights go through this change, b catches up to it.
Another speculative thought, let’s say we are moving from 4* → 5* and |W_3| is the new W that is taking on high magnitude. Does this occur because somehow W_3 has enough internal individual weights to jointly look at it’s two (new) neighbors’ W_i`s roughly equally?
Does the cos similarity and/or dot product of this new W_3 with its neighbors grow during the 4* → 5* transition (and does this occur prior to the change in b?)

RGRGRG 11 Nov 2023 16:52 UTC
1 point
0
on: Growth and Form in a Toy Model of Superposition
Question about the gif—to me it looks like the phase transition is more like:
4++- to unstable 5+- to 4+- to 5-
(Unstable 5+- seems to have similar loss to 4+-).
Why do we not count the large red bar as a “-” ?

RGRGRG 11 Oct 2023 14:16 UTC
1 point
0
in reply to: Nora_Ammann’s comment on: Become a PIBBSS Research Affiliate
If I were to be accepted for this cycle, would I be expected to attend any events in Europe? To be clear, I could attend all events in and around Berkeley.

RGRGRG 11 Oct 2023 2:42 UTC
LW: 1 AF: 1
0
AF
on: Become a PIBBSS Research Affiliate
What city/country is PIBBSS based out of / where will the retreats be? (Asking as a Bay Area American without a valid passport).

RGRGRG 9 Sep 2023 22:23 UTC
1 point
0
on: What I would do if I wasn’t at ARC Evals
I really like your ambitious MI section and I think you hit on a few interesting questions I’ve come across elsewhere:
Two researchers interpreted a 1-layer transformer network and then I interpreted it differently—there isn’t a great way to compare our explanations (or really know how similar vs different our explanations are).
With papers like the Hydra effect that demonstrate similar knowledge can be spread throughout a network, it’s not clear to if we want to/how to analyze impact—can/should we jointly ablate multiple units across different heads at once?
I’m personally unsure how to split my time between interpreting small networks vs larger ones. Should I focus 100% on interpreting 1-2 layer TinyStories LMs or is looking into 16+ layer LLMs valuable at this time?

RGRGRG 15 Aug 2023 3:04 UTC
1 point
0
on: Decomposing independent generalizations in neural networks via Hessian analysis
nit: do you mean 6x6 Boolean patterns not 4x4?

RGRGRG 1 Aug 2023 23:10 UTC
1 point
0
in reply to: RGRGRG’s comment on: Thoughts on sharing information about language model capabilities
As one specific example—has RLHF, which the below post suggests was potentially was initially intended for safety, been a net negative for AI safety?
https://www.alignmentforum.org/posts/LqRD7sNcpkA9cmXLv/open-problems-and-fundamental-limitations-of-rlhf

RGRGRG

Thoughts about the Mechanis­tic In­ter­pretabil­ity Challenge #2 (EIS VII #2)

[Question] Best Ways to Try to Get Fund­ing for Align­ment Re­search?

Seek­ing Feed­back on My Mechanis­tic In­ter­pretabil­ity Re­search Agenda

Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

[Question] Best Ways to Try to Get Funding for Alignment Research?

Seeking Feedback on My Mechanistic Interpretability Research Agenda