Karma: 1,009

# Causal scrub­bing: re­sults on in­duc­tion heads

3 Dec 2022 0:59 UTC
30 points

# Causal scrub­bing: re­sults on a paren bal­ance checker

3 Dec 2022 0:59 UTC
24 points

# Causal scrub­bing: Appendix

3 Dec 2022 0:58 UTC
13 points

# Causal Scrub­bing: a method for rigor­ously test­ing in­ter­pretabil­ity hy­pothe­ses [Red­wood Re­search]

3 Dec 2022 0:58 UTC
76 points
• 1 Dec 2022 20:57 UTC
LW: 1 AF: 1
0 ∶ 0
AF

Well, I’d keep everything in log space and do the whole thing with log_sum_exp for numerical stability, but yeah.

EDIT: e.g. something like:

import torch.nn.functional as F

def cross_entropy_loss(Z, C):
return -torch.sum(F.log_softmax(Z) * C)

• 1 Dec 2022 20:42 UTC
LW: 1 AF: 1
0 ∶ 0
AF

The plateau corresponds to a period with a lot of bump formation. If bumps really are a sign of vectors competing to represent different chunks of subspace then maybe this says that Adam produces more such competition (maybe by making different vectors learn at more similar rates?).

I caution against over-interpreting the results of single runs—I think there’s a good chance the number of bumps varies significantly by random seed.

• 1 Dec 2022 20:38 UTC
LW: 1 AF: 1
0 ∶ 0
AF

What happens in a cross-entropy loss style setup, rather than MSE loss? IMO cross-entropy loss is a better analogue to real networks. Though I’m confused about the right way to model an internal sub-circuit of the model. I think the exponential decay term just isn’t there?

There’s lots of ways to do this, but the obvious way is to flatten C and Z and treat them as logits.

• 1 Dec 2022 20:22 UTC
LW: 2 AF: 2
0 ∶ 0
AF

Oh, huh, that makes a lot of sense! I’ll see if I can reproduce these results.

For example, from a rank-4 case with a bump, here’s the inner product with a single target vector of two learned vectors.

I’m not sure this explains the grokking bumps from the mod add stuff—I’m not sure what the should be “competition” should be given we see the bumps on every key frequency.

• 1 Dec 2022 20:21 UTC
LW: 2 AF: 2
0 ∶ 0
AF

How does this with interact with Adam? In particular, Adam gets super messy because you can’t just disentangle things. Even worse, how does it interact with AdamW?

Should be trivial to modify my code to use AdamW, just replace SGD with Adam on line 33.

EDIT: ran the experiments for rank 1, they seem a bit different than Adam Jermyn’s results—it looks like AdamW just accelerates things?

• 1 Dec 2022 20:20 UTC
LW: 2 AF: 2
0 ∶ 0
AF
in reply to: Neel Nanda’s comment

(Adam Jermyn ninja’ed my rank 2 results as I forgot to refresh, lol)

Weight decay just means the gradient becomes , which effectively “extends” the exponential phase. It’s pretty easy to confirm that this is the case:

You can see the other figures from the main post here:
https://​​imgchest.com/​​p/​​9p4nl6vb7nq

(Lighter color shows loss curve for each of 10 random seeds.)

Here’s my code for the weight decay experiments if anyone wants to play with them or check that I didn’t mess something up: https://​​gist.github.com/​​Chanlaw/​​e8c286629e0626f723a20cef027665d1

• In a sense, I think all math papers with focus on definitions (as opposed to proofs) feel like this.

I suspect one of the reasons OP feels dissatisfied about the corrigibility paper is that it is not the equivalent of Shannon’s seminal results, which generally gave the correct definition of terms, but instead merely gesturing at a problem (“we have no idea how to formalize corrigibility!”).

That being said, I resonate a lot with this part of the reply:

Proofs [in conceptual/​definition papers] are correct but trivial, so definitions are the real contribution, but applicability of definitions to the real world seems questionable. Proof-focused papers feel different because they are about accepted definitions whose applicability to the real world is not in question.

• 25 Nov 2022 11:03 UTC
3 points
0 ∶ 0
in reply to: Rohin Shah’s comment

I feel like most of the barrier in practice for people not “coordinating” in the relevant ways is people not knowing what other people are doing. And a big reason for this is that writing is really hard to write, especially if you have high standards and don’t want to ship.

And yeah, better communication tech in general would be good, but I’m not sure how to start on that (while it’s pretty obvious what a few candidate steps toward making posts/​papers easier to write/​ship would look like?)

• If you have that belief, I imagine this paper should update you more towards AI capabilities.

I do believe David Chapman’s tweet though! I don’t think you can just hotwire together a bunch of modules that are superhuman only in narrow domains, and get a powerful generalist agent, without doing a lot of work in the middle.

(That being said, I don’t count gluing together a Python interpreter and a retrieval mechanism to a fine-tuned GPT-3 or whatever to fall in this category; here the work is done by GPT-3 (a generalist agent) and the other parts are primarily augmenting its capabilities.)

• I don’t see why it should update my beliefs a non-neglible amount? I expected techniques like this to work for a wide variety of specific tasks given enough effort (indeed, stacking together 5 different techniques into a specialist agent is how a lot of academic work in robotics looks like). I also think that the way people can compose text-davinci-002 or other LMs with themselves into more generalist agents basically should screen off this evidence, even if you weren’t expecting to see it.