MadHatter

Karma: 326

“We are computer scientists. We do not lack in faith.” (Ketan Mulmuley)

MadHatter 9 Jun 2021 4:45 UTC
LW: 6 AF: 2
AF
on: Big picture of phasic dopamine
This was an amazing article, thank you for posting it!
- Side tangent: There’s an annoying paradox that: (1) In RL, there’s no “zero of reward”, you can uniformly add 99999999 to every reward signal and it makes no difference whatsoever; (2) In life, we have a strong intuition that experiences can be good, bad, or neutral; (3) …Yet presumably what our brain is doing has something to do with RL! That “evolutionary prior” I just mentioned is maybe relevant to that? Not sure … food for thought …
The above isn’t quite true in all senses in all RL algorithms. For example, in policy gradient algorithms (http://www.scholarpedia.org/article/Policy_gradient_methods for a good but fairly technical introduction) it is quite important in practice to subtract a baseline value from the reward that is fed into the policy gradient update. (Note that the baseline can be and most profitably is chosen to be dynamic—it’s a function of the state the agent is in. I think it’s usually just chosen to be V(s) = max Q(s,a).) The algorithm will in theory converge to the right value without the baseline, but subtracting the baseline speeds convergence up significantly. If one guesses that the brain is using a policy-gradients-like algorithm, a similar principle would presumably apply. This actually dovetails quite nicely with observed human psychology—good/bad/neutral is a thing, but it seems to be defined largely with respect to our expectation of what was going to happen in the situation we were in. For example, many people get shitty when it turns out they aren’t going to end up having sex that they thought they were going to have—so here the theory would be that the baseline value was actually quite high (they were anticipating a peak experience) and the policy gradients update will essentially treat this as an aversive stimulus, which makes no sense without the existence of the baseline.
It’s closer to being true of Q-learning algorithms, but here too there is a catch—whatever value you assign to never-before-seen states can have a pretty dramatic effect on exploration dynamics, at least in tabular environments (i.e. environments with negligible generalization). So here too one would expect that there is a evolutionarily appropriate level of optimism to apply to genuinely novel situations about which it is difficult to form an a priori judgment, and the difference between this and the value you assign to known situations is at least probably known-to-evolution.

MadHatter 29 Jul 2021 21:49 UTC
4 points
on: Neuroscience things that confuse me right now
Various thoughts:
- It would make a lot of sense to me if norepinephrine acted as a Q-like signal for negative rewards. I don’t have any neuroscience evidence for this, but it makes sense to me that negative rewards and positive rewards are very different for animals and would benefit from different approaches. I once ran some Q-learning experiments on the classic Taxi environment to see if I could make a satisficing agent (one that achieves a certain reward less than the maximum achievable and then rests). The agent responded by taking illegal actions that give highly negative rewards in the Taxi environment and hustling as hard as possible the rest of the time to achieve the reward specified. So I had to add a Q-function solely for negative rewards to get the desired behavior. Given that actual animals need to rest in a way that RL agents don’t have to in most environments, it makes sense to me that Q-learning on its own is not a good brain architecture.
- Dopamine receptors in V1 kind of makes sense if you want to visually predict reward-like properties of objects in the environment. Like something could look tasty or not tasty, maybe.

MadHatter 12 Dec 2021 23:59 UTC
2 points
in reply to: mtaran’s comment on: Teaser: Hard-coding Transformer Models
It’s very, very rough, but: https://github.com/epurdy/hand

MadHatter 13 Dec 2021 2:15 UTC
1 point
in reply to: gwern’s comment on: Teaser: Hard-coding Transformer Models
It’s a pretty similar style of work, but I haven’t communicated at all with those authors and I started my work before they published.
What links here?
- Vivek Hebbar's comment on Transformer inductive biases & RASP by Vivek Hebbar (24 Feb 2022 1:34 UTC; 1 point)

MadHatter 13 Dec 2021 12:41 UTC
2 points
in reply to: evhub’s comment on: Teaser: Hard-coding Transformer Models
I have definitely not thought about that before. Feedback from people I have shown this work to has ranged from (literally) “you are a madman” to “that looks cool” (and then never engaging with it).

MadHatter 13 Dec 2021 20:56 UTC
1 point
in reply to: Jsevillamol’s comment on: Hard-Coding Neural Computation
Added an example sentence and its embeddings. Will add more examples overall. Thanks for commenting!

MadHatter 13 Dec 2021 21:24 UTC
2 points
in reply to: Igor Ostrovsky’s comment on: Teaser: Hard-coding Transformer Models
Thanks for throwing it up there!!!

MadHatter 14 Dec 2021 1:00 UTC
1 point
in reply to: nostalgebraist’s comment on: Hard-Coding Neural Computation
Thanks for your comments/questions, they’re very insightful.
In general, there are as many encoding spaces in a Transformer as there are computational nodes, and a traditional Transformer will have little incentive to use the same semantics for any two of the spaces. (There’s a little bit of an incentive because of the residual connections, which will (I think?) kind of tie the semantics of the various hidden-size-sized embeddings spaces.)
In particular, the middle layer of the dense-relu-dense feedforward layer is usually chosen to be significantly larger (4x) than the hidden size, and so it’s not even theoretically possible to represent it using the same basis. I’ve found that it sometimes makes sense to use anonymous seme names like x1 x2 x3 etc in the feed-forward layer for this reason. In my experience so far I’ve found the feed-forward layers to be most useful for conjunctions and disjunctions—and there are a quadratic number of possible conjunctions and disjunctions of even two neurons, let alone 3 or 4. So it seems to me that this might give a tiny hint as to why people have found that the intermediate embedding space of the feed-forward layer needs to be so large.
Of course, there is a potentially huge gap between what I am clever enough to think of as a use for them and what good old gradient descent is clever enough to think of. We can only easily lower-bound the potential uses of them; upper-bounding the capabilities of a component will prove much more challenging.

MadHatter 14 Dec 2021 20:56 UTC
1 point
in reply to: pando’s comment on: Hard-Coding Neural Computation
There are a number of ways to combine this approach with learning, but I haven’t had time to try any of them yet. Some ideas I have thought of:
- Use hard-coded weights, plus some random noise, to initialize the weights of a transformer that you then train in the traditional fashion
  - Doesn’t really help with interpretability or alignment, but might(???) help with performance
- Write out all the weight and bias parameters as combinations of semes and outer products of semes, then learn seme embeddings by gradient descent
- Semantic seme embeddings could be initialized from something like WordNet relationships, or learned with word2vec, to automate those guys
- You could do smallish amounts of gradient descent to suggest new rules to add, but then add them by hand
  - Still would be very slow
- Perhaps it is possible to start with a strong learned transformer and gradually identify human-legible rules that it is using, and replacing those specific parts with hard-coding
  - Could prove very difficult!!!
  - It seems almost certain to me that hard-coding weights would at least help us build the muscles needed to recognize what is going on, to the extent that we are able to

MadHatter 19 Dec 2021 17:44 UTC
1 point
in reply to: Rohin Shah’s comment on: Teaser: Hard-coding Transformer Models
Agree with this.

MadHatter 24 Dec 2021 21:14 UTC
1 point
in reply to: Christopher Olah’s comment on: Mechanistic Interpretability for the MLP Layers (rough early thoughts)
Thanks! Enjoy your holidays!
- Well now I feel kind of dumb (for misremembering how LayerNorm works). I’ve actually spent the past day since making the video wondering why information leakage of the form you describe doesn’t occur in most transformers, so it’s honestly kind of a relief to realize this.
- It seems to me that ReLU is a reasonable approximation of GELU, even for networks that are actually using GELU. So one can think about the $G E L U (x) = x Φ (x)$ as just having a slightly messy mask function ( $Φ (x)$ ) that is sort-of-well-approximated by the ReLU binary mask function.

MadHatter 5 Jan 2022 4:27 UTC
9 points
on: Self-Organised Neural Networks: A simple, natural and efficient way to intelligence
I’m going to try to port this to python, just to see how it works, and make it easier for other people to try variations on it. I’ll post a repo link under this comment when I have it to any sort of decent state.

MadHatter 6 Jan 2022 4:03 UTC
10 points
in reply to: MadHatter’s comment on: Self-Organised Neural Networks: A simple, natural and efficient way to intelligence
Started working on a python version here:
https://github.com/epurdy/dpis_spiking_network
As of now I have a (probably buggy) full translation that uses python for-loops (so can’t be made fast) and have started on a more pythonic translation that can probably be put on a gpu relatively easily.
Dpi, I welcome any contributions or corrections you have to this repository. Since you don’t know python it will probably be hard to contribute to the python versions, but even just uploading the C version would be helpful.
Let me know what license I should use for this repository, if any.

MadHatter 9 Jan 2022 19:41 UTC
3 points
on: D&D.Sci Holiday Special: How the Grinch Pessimized Christmas
Approach:
I split the problem into two parts: first, modeling how much noise will be produced by a given Who child with given presents, and second, how to optimize that value.
I declined to use the names of the Who children, since my intuition said that those shouldn’t be predictive of anything. Also, there were Who children with the same name and same ID who lived years apart, which seemed like a bug.
I tried several models (random forest, gradient boosted forest) but got the best cross-validation accuracy when I used a ridge regression with product features. I ended up using the following features:
[‘Age’, ‘BlumBlooper__Age’, ‘BlumBlooper’, ‘FumFoozler__Age’, ‘FumFoozler__BlumBlooper’, ‘FumFoozler’, ‘GahGinka__Age’, ‘GahGinka__BlumBlooper’, ‘GahGinka__FumFoozler’, ‘GahGinka’, ‘SlooSlonker__Age’, ‘SlooSlonker__BlumBlooper’, ‘SlooSlonker__FumFoozler’, ‘SlooSlonker__GahGinka’, ‘SlooSlonker’, ‘SlooSlonker__GenderDummy_F’, ‘SlooSlonker__GenderDummy_M’, ‘TrumTroopa__Age’, ‘TrumTroopa__BlumBlooper’, ‘TrumTroopa__FumFoozler’, ‘TrumTroopa__GahGinka’, ‘TrumTroopa__SlooSlonker’, ‘TrumTroopa’, ‘TrumTroopa__GenderDummy_F’, ‘TrumTroopa__GenderDummy_M’, ‘WhoWhonker__Age’, ‘WhoWhonker__BlumBlooper’, ‘WhoWhonker__FumFoozler’, ‘WhoWhonker__GahGinka’, ‘WhoWhonker__SlooSlonker’, ‘WhoWhonker__TrumTroopa’, ‘WhoWhonker’, ‘WhoWhonker__GenderDummy_F’, ‘WhoWhonker__GenderDummy_M’, ‘GenderDummy_F__Age’, ‘GenderDummy_F__BlumBlooper’, ‘GenderDummy_F__FumFoozler’, ‘GenderDummy_F__GahGinka’, ‘GenderDummy_F’, ‘GenderDummy_M__Age’, ‘GenderDummy_M__BlumBlooper’, ‘GenderDummy_M__FumFoozler’, ‘GenderDummy_M__GahGinka’, ‘GenderDummy_M__GenderDummy_F’, ‘GenderDummy_M’]
To optimize the noise, I assigned the presents randomly, checking that each was unique. Then I did a Markov chain optimization procedure where I swapped presents if it improved the score or made it worse by less than a random threshold. This procedure could probably be improved. I’m thinking about applying a quadratic programming library to the optimization procedure, but that seems kind of difficult.
Maximum noise proposal
Estimated noise: 195.72749659660874
Andy Sue Who WhoWhonker SlooSlonker
Betty Drew Who FumFoozler SlooSlonker
Sally Sue Who FumFoozler SlooSlonker
Phoebe Drew Who BlumBlooper FumFoozler
Freddie Lou Who TrumTroopa WhoWhonker
Eddie Sue Who TrumTroopa WhoWhonker
Cindy Drew Who GahGinka FumFoozler
Mary Lou Who BlumBlooper GahGinka
Ollie Lou Who BlumBlooper WhoWhonker
Johnny Drew Who TrumTroopa BlumBlooper
Minimum noise proposal
Estimated noise: 129.9544674398252
Andy Sue Who TrumTroopa GahGinka
Betty Drew Who BlumBlooper WhoWhonker
Sally Sue Who BlumBlooper WhoWhonker
Phoebe Drew Who BlumBlooper WhoWhonker
Freddie Lou Who FumFoozler GahGinka
Eddie Sue Who FumFoozler TrumTroopa
Cindy Drew Who SlooSlonker WhoWhonker
Mary Lou Who BlumBlooper SlooSlonker
Ollie Lou Who FumFoozler TrumTroopa
Johnny Drew Who FumFoozler SlooSlonker

MadHatter 10 Jan 2022 3:37 UTC
3 points
in reply to: aphyer’s comment on: D&D.Sci Holiday Special: How the Grinch Pessimized Christmas
Ah, I got confused by Phoebe Drew Who, who shows up with ids 1533 and 1553.

MadHatter 10 Jan 2022 3:42 UTC
1 point
in reply to: MadHatter’s comment on: D&D.Sci Holiday Special: How the Grinch Pessimized Christmas
After I posted my first post, but before reading the other answers, it occurred to me that I was probably leaving noise on the table by not modeling the individual Who children. Reading the other answers, it seems like doing that is key.
Revised results below when taking individual idosyncrasies into account in the ridge regression:
MIN SOLUTION
130.6603587239382
Andy Sue Who TrumTroopa FumFoozler
Betty Drew Who WhoWhonker BlumBlooper
Sally Sue Who BlumBlooper WhoWhonker
Phoebe Drew Who WhoWhonker BlumBlooper
Freddie Lou Who TrumTroopa GahGinka
Eddie Sue Who GahGinka FumFoozler
Cindy Drew Who SlooSlonker BlumBlooper
Mary Lou Who SlooSlonker WhoWhonker
Ollie Lou Who SlooSlonker FumFoozler
Johnny Drew Who TrumTroopa FumFoozler
MAX SOLUTION
210.90134871092357
Andy Sue Who SlooSlonker WhoWhonker
Betty Drew Who SlooSlonker FumFoozler
Sally Sue Who TrumTroopa FumFoozler
Phoebe Drew Who SlooSlonker FumFoozler
Freddie Lou Who WhoWhonker BlumBlooper
Eddie Sue Who BlumBlooper WhoWhonker
Cindy Drew Who GahGinka FumFoozler
Mary Lou Who TrumTroopa GahGinka
Ollie Lou Who WhoWhonker BlumBlooper
Johnny Drew Who BlumBlooper TrumTroopa

MadHatter 11 Jan 2022 2:33 UTC
4 points
on: D&D.Sci Holiday Special: How the Grinch Pessimized Christmas Evaluation & Ruleset
Thanks for organizing!
Feedback: I was a little bit surprised to see a perfectly regular solution. (And I did relatively poorly because of my assumption that there would not be one.) I feel like real-world data is never as clean as this; on the other hand, all data benefits from taking a closer look at it and trying to understand if there are any regularities in the failure modes of your modeling toolkit, so maybe this is just a lesson for me. Hard to say!

MadHatter 4 Nov 2022 4:11 UTC
2 points
0
on: Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?
Very cool stuff! Do you have the notebook on colab or something? Kind of want to find out how the story ends, whether that’s in a second half video or just playing around with the code. At the end of this video you had what looked like fairly clean positional embeddings coming out of MLP0. Also the paying-attention-to-self in the second attention layer could plausibly be something to do with erasing the information that comes in on that token, since that’s something that all transformer decoders have to do in some fashion or another.
Pretty sure the loss spikes were coming from using max rather than min when defining the learning rate schedule. Your learning rate multiplier starts at 1 and then linearly increases as step/100 once it reaches 100, which makes sense why it behaves itself for a while and then ultimately diverges for large numbers of steps.

MadHatter 4 Nov 2022 4:56 UTC
1 point
0
in reply to: MadHatter’s comment on: Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?
Oops, did not read the post carefully enough, you’ve already linked to the colab!

MadHatter 4 Nov 2022 5:14 UTC
1 point
2
in reply to: MadHatter’s comment on: Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?
Yeah, just changing the max to a min produces this much smoother loss curve from your notebook..