Master Student in AI at University of Amsterdam
I wrote a small library for transformer interpretability: https://unseal.readthedocs.io/en/latest/
Right now I’m really interested in grokking and mechanistic interpretability more generally.
Master Student in AI at University of Amsterdam
I wrote a small library for transformer interpretability: https://unseal.readthedocs.io/en/latest/
Right now I’m really interested in grokking and mechanistic interpretability more generally.
There is definitely something out there, just can’t recall the name. A keyword you might want to look for is “disentangled representations”.
One start would be the beta-VAE paper https://openreview.net/forum?id=Sy2fzU9gl
Considering you get at least one free upvote from posting/commenting itself, you just have to be faster than the downvoters to generate money :P
Small nitpick:
The PCA plot is using the smallest version of GPT2, and not the 1.5B parameter model (that would be GPT2-XL). The small model is significantly worse than the large one and so I would be hesitant to draw conclusions from that experiment alone.
I want to second your first point. Texting frequently with significant others lets me feel be part of their life and vice versa which a weekly call does not accomplish, partly because it is weekly and partly because I am pretty averse to calls.
In one relationship I had, this led to significant misery on my part because my partner was pretty strict on their phone usage, batching messages for the mornings and evenings. For my current primary relationship, I’m convinced that the frequent texting is what kept it alive while being long-distance.
To reconcile the two viewpoints, I think it is still true that superficial relationships via social media likes or retweets are not worth that much if they are all there is to the relationship. But direct text messages are a significant improvement on that.
Re your blog post:
Maybe that’s me being introverted but there are probably significant differences in whether people feel comfortable/like texting or calling. For me, the instantaneousness of calling makes it much more stressful, and I do have a problem with people generalizing either way that one way to interact over distances is superior in general. I do cede the point that calling is of course much higher bandwidth, but it also requires more time commitment and coordination.
I tried increasing weight decay and increased batch sizes but so far no real success compared to 5x lr. Not going to investigate this further atm.
Oh I thought figure 1 was S5 but it actually is modular division. I’ll give that a go..
Here are results for modular division. Not super sure what to make of them. Small increases in learning rate work, but so does just choosing a larger learning rate from the beginning. In fact, increasing lr to 5x from the beginning works super well but switching to 5x once grokking arguably starts just destroys any progress. 10x lr from the start does not work (nor when switching later)
So maybe the initial observation is more a general/global property of the loss landscape for the task and not of the particular region during grokking?
So I ran some experiments for the permutation group S_5 with the task x o y = ?
Interestingly here increasing the learning rate just never works. I’m very confused.
I updated the report with the training curves. Under default settings, 100% training accuracy is reached after 500 steps.
There is actually an overlap between the train/val curves going up. Might be an artifact of the simplicity of the task or that I didn’t properly split the dataset (e.g. x+y being in train and y+x being in val). I might run it again for a harder task to verify.
Yep I used my own re-implementation, which somehow has slightly different behavior.
I’ll also note that the task in the report is modular addition while figure 1 from the paper (the one with the red and green lines for train/val) is the significantly harder permutation group task.
I’m not sure I understand.
I chose the grokking starting point as 300 steps, based on the yellow plot. I’d say it’s reasonable to say that ‘grokking is complete’ by the 2000 step mark in the default setting, whereas it is complete by the 450 step mark in the 10x setting (assuming appropriate LR decay to avoid overshooting). Also note that the plots in the report are not log-scale
It would be interesting to see if, once grokking had clearly started, you could just 100x the learning rate and speed up the convergence to zero validation loss by 100x.
I ran a quick-and-dirty experiment and it does in fact look like you can just crank up the learning rate at the point where some part of grokking happens to speed up convergence significantly. See the wandb report:
I set the LR to 5x the normal value (100x tanked the accuracy, 10x still works though). Of course you would want to anneal it after grokking was finished.
Ah yes that makes sense to me. I’ll modify the post accordingly and probably write it in the basis formulation.
ETA: Fixed now, computation takes a tiny bit longer but hopefully still readable to everyone.
Seems like this could be circumvented relatively easily by freezing gametes now.
Are you asking exclusively about “Machine Learning” systems or also GOFAI? E.g. I notice that you didn’t include ELIZA in your database, but that was a hard coded program so maybe doesn’t match your criteria.
Trying to summarize your viewpoint, lmk if I’m missing something important:
Training self-organizing models on multi-modal input will lead to increased modularization and in turn to more interpretability
Existing interpretability techniques might more or less transfer to self-organizing systems
There are low-hanging fruits in applied interpretability that we could exploit should we need them in order to understand self-organizing systems
(Not going into the specific proposals for sake of brevity and clarity)
Out of curiosity, are you willing to share the papers you improved upon?
Interpretability will fail—future DL descendant is more of a black box, not less
It certainly makes interpretability harder, but it seems like the possible gain is also larger, making it a riskier bet overall. I’m not convinced that it decreases the expected value of interpretability research though. Do you have a good intuition for why it would make interpretability less valuable or at least lower value compared to the increased risk of failure?
IRL/Value Learning is far more difficult than first appearances suggest, see #2
That’s not immediately clear to me. Could you elaborate?
Small nitpick: I would cite The Bitter Lesson in the beginning.
I can’t speak to the option for remote work but as a counterpoint, it seems very straightforward to get a UK visa for you and your spouse/children (at least straightforward relative to the US). The relevant visa to google is the Skilled Worker / Tier 2 visa if you want to know more.
ETA: Of course, there are still legitimate reasons for not wanting to move. Just wanted to point out that the legal barrier is lower than you might think.