“We are computer scientists. We do not lack in faith.” (Ketan Mulmuley)
MadHatter
Various thoughts:
It would make a lot of sense to me if norepinephrine acted as a Q-like signal for negative rewards. I don’t have any neuroscience evidence for this, but it makes sense to me that negative rewards and positive rewards are very different for animals and would benefit from different approaches. I once ran some Q-learning experiments on the classic Taxi environment to see if I could make a satisficing agent (one that achieves a certain reward less than the maximum achievable and then rests). The agent responded by taking illegal actions that give highly negative rewards in the Taxi environment and hustling as hard as possible the rest of the time to achieve the reward specified. So I had to add a Q-function solely for negative rewards to get the desired behavior. Given that actual animals need to rest in a way that RL agents don’t have to in most environments, it makes sense to me that Q-learning on its own is not a good brain architecture.
Dopamine receptors in V1 kind of makes sense if you want to visually predict reward-like properties of objects in the environment. Like something could look tasty or not tasty, maybe.
It’s very, very rough, but: https://github.com/epurdy/hand
It’s a pretty similar style of work, but I haven’t communicated at all with those authors and I started my work before they published.
- 24 Feb 2022 1:34 UTC; 1 point) 's comment on Transformer inductive biases & RASP by (
I have definitely not thought about that before. Feedback from people I have shown this work to has ranged from (literally) “you are a madman” to “that looks cool” (and then never engaging with it).
Added an example sentence and its embeddings. Will add more examples overall. Thanks for commenting!
Thanks for throwing it up there!!!
Thanks for your comments/questions, they’re very insightful.
In general, there are as many encoding spaces in a Transformer as there are computational nodes, and a traditional Transformer will have little incentive to use the same semantics for any two of the spaces. (There’s a little bit of an incentive because of the residual connections, which will (I think?) kind of tie the semantics of the various hidden-size-sized embeddings spaces.)
In particular, the middle layer of the dense-relu-dense feedforward layer is usually chosen to be significantly larger (4x) than the hidden size, and so it’s not even theoretically possible to represent it using the same basis. I’ve found that it sometimes makes sense to use anonymous seme names like x1 x2 x3 etc in the feed-forward layer for this reason. In my experience so far I’ve found the feed-forward layers to be most useful for conjunctions and disjunctions—and there are a quadratic number of possible conjunctions and disjunctions of even two neurons, let alone 3 or 4. So it seems to me that this might give a tiny hint as to why people have found that the intermediate embedding space of the feed-forward layer needs to be so large.
Of course, there is a potentially huge gap between what I am clever enough to think of as a use for them and what good old gradient descent is clever enough to think of. We can only easily lower-bound the potential uses of them; upper-bounding the capabilities of a component will prove much more challenging.
There are a number of ways to combine this approach with learning, but I haven’t had time to try any of them yet. Some ideas I have thought of:
Use hard-coded weights, plus some random noise, to initialize the weights of a transformer that you then train in the traditional fashion
Doesn’t really help with interpretability or alignment, but might(???) help with performance
Write out all the weight and bias parameters as combinations of semes and outer products of semes, then learn seme embeddings by gradient descent
Semantic seme embeddings could be initialized from something like WordNet relationships, or learned with word2vec, to automate those guys
You could do smallish amounts of gradient descent to suggest new rules to add, but then add them by hand
Still would be very slow
Perhaps it is possible to start with a strong learned transformer and gradually identify human-legible rules that it is using, and replacing those specific parts with hard-coding
Could prove very difficult!!!
It seems almost certain to me that hard-coding weights would at least help us build the muscles needed to recognize what is going on, to the extent that we are able to
Agree with this.
Thanks! Enjoy your holidays!
Well now I feel kind of dumb (for misremembering how LayerNorm works). I’ve actually spent the past day since making the video wondering why information leakage of the form you describe doesn’t occur in most transformers, so it’s honestly kind of a relief to realize this.
It seems to me that ReLU is a reasonable approximation of GELU, even for networks that are actually using GELU. So one can think about the as just having a slightly messy mask function () that is sort-of-well-approximated by the ReLU binary mask function.
I’m going to try to port this to python, just to see how it works, and make it easier for other people to try variations on it. I’ll post a repo link under this comment when I have it to any sort of decent state.
Started working on a python version here:
https://github.com/epurdy/dpis_spiking_network
As of now I have a (probably buggy) full translation that uses python for-loops (so can’t be made fast) and have started on a more pythonic translation that can probably be put on a gpu relatively easily.
Dpi, I welcome any contributions or corrections you have to this repository. Since you don’t know python it will probably be hard to contribute to the python versions, but even just uploading the C version would be helpful.
Let me know what license I should use for this repository, if any.
Approach:
I split the problem into two parts: first, modeling how much noise will be produced by a given Who child with given presents, and second, how to optimize that value.
I declined to use the names of the Who children, since my intuition said that those shouldn’t be predictive of anything. Also, there were Who children with the same name and same ID who lived years apart, which seemed like a bug.
I tried several models (random forest, gradient boosted forest) but got the best cross-validation accuracy when I used a ridge regression with product features. I ended up using the following features:
[‘Age’, ‘BlumBlooper__Age’, ‘BlumBlooper’, ‘FumFoozler__Age’, ‘FumFoozler__BlumBlooper’, ‘FumFoozler’, ‘GahGinka__Age’, ‘GahGinka__BlumBlooper’, ‘GahGinka__FumFoozler’, ‘GahGinka’, ‘SlooSlonker__Age’, ‘SlooSlonker__BlumBlooper’, ‘SlooSlonker__FumFoozler’, ‘SlooSlonker__GahGinka’, ‘SlooSlonker’, ‘SlooSlonker__GenderDummy_F’, ‘SlooSlonker__GenderDummy_M’, ‘TrumTroopa__Age’, ‘TrumTroopa__BlumBlooper’, ‘TrumTroopa__FumFoozler’, ‘TrumTroopa__GahGinka’, ‘TrumTroopa__SlooSlonker’, ‘TrumTroopa’, ‘TrumTroopa__GenderDummy_F’, ‘TrumTroopa__GenderDummy_M’, ‘WhoWhonker__Age’, ‘WhoWhonker__BlumBlooper’, ‘WhoWhonker__FumFoozler’, ‘WhoWhonker__GahGinka’, ‘WhoWhonker__SlooSlonker’, ‘WhoWhonker__TrumTroopa’, ‘WhoWhonker’, ‘WhoWhonker__GenderDummy_F’, ‘WhoWhonker__GenderDummy_M’, ‘GenderDummy_F__Age’, ‘GenderDummy_F__BlumBlooper’, ‘GenderDummy_F__FumFoozler’, ‘GenderDummy_F__GahGinka’, ‘GenderDummy_F’, ‘GenderDummy_M__Age’, ‘GenderDummy_M__BlumBlooper’, ‘GenderDummy_M__FumFoozler’, ‘GenderDummy_M__GahGinka’, ‘GenderDummy_M__GenderDummy_F’, ‘GenderDummy_M’]
To optimize the noise, I assigned the presents randomly, checking that each was unique. Then I did a Markov chain optimization procedure where I swapped presents if it improved the score or made it worse by less than a random threshold. This procedure could probably be improved. I’m thinking about applying a quadratic programming library to the optimization procedure, but that seems kind of difficult.
Maximum noise proposal
Estimated noise: 195.72749659660874
Andy Sue Who WhoWhonker SlooSlonker
Betty Drew Who FumFoozler SlooSlonker
Sally Sue Who FumFoozler SlooSlonker
Phoebe Drew Who BlumBlooper FumFoozler
Freddie Lou Who TrumTroopa WhoWhonker
Eddie Sue Who TrumTroopa WhoWhonker
Cindy Drew Who GahGinka FumFoozler
Mary Lou Who BlumBlooper GahGinka
Ollie Lou Who BlumBlooper WhoWhonker
Johnny Drew Who TrumTroopa BlumBlooper
Minimum noise proposal
Estimated noise: 129.9544674398252
Andy Sue Who TrumTroopa GahGinka
Betty Drew Who BlumBlooper WhoWhonker
Sally Sue Who BlumBlooper WhoWhonker
Phoebe Drew Who BlumBlooper WhoWhonker
Freddie Lou Who FumFoozler GahGinka
Eddie Sue Who FumFoozler TrumTroopa
Cindy Drew Who SlooSlonker WhoWhonker
Mary Lou Who BlumBlooper SlooSlonker
Ollie Lou Who FumFoozler TrumTroopa
Johnny Drew Who FumFoozler SlooSlonker
Ah, I got confused by Phoebe Drew Who, who shows up with ids 1533 and 1553.
After I posted my first post, but before reading the other answers, it occurred to me that I was probably leaving noise on the table by not modeling the individual Who children. Reading the other answers, it seems like doing that is key.
Revised results below when taking individual idosyncrasies into account in the ridge regression:
MIN SOLUTION
130.6603587239382
Andy Sue Who TrumTroopa FumFoozler
Betty Drew Who WhoWhonker BlumBlooper
Sally Sue Who BlumBlooper WhoWhonker
Phoebe Drew Who WhoWhonker BlumBlooper
Freddie Lou Who TrumTroopa GahGinka
Eddie Sue Who GahGinka FumFoozler
Cindy Drew Who SlooSlonker BlumBlooper
Mary Lou Who SlooSlonker WhoWhonker
Ollie Lou Who SlooSlonker FumFoozler
Johnny Drew Who TrumTroopa FumFoozler
MAX SOLUTION
210.90134871092357
Andy Sue Who SlooSlonker WhoWhonker
Betty Drew Who SlooSlonker FumFoozler
Sally Sue Who TrumTroopa FumFoozler
Phoebe Drew Who SlooSlonker FumFoozler
Freddie Lou Who WhoWhonker BlumBlooper
Eddie Sue Who BlumBlooper WhoWhonker
Cindy Drew Who GahGinka FumFoozler
Mary Lou Who TrumTroopa GahGinka
Ollie Lou Who WhoWhonker BlumBlooper
Johnny Drew Who BlumBlooper TrumTroopa
Thanks for organizing!
Feedback: I was a little bit surprised to see a perfectly regular solution. (And I did relatively poorly because of my assumption that there would not be one.) I feel like real-world data is never as clean as this; on the other hand, all data benefits from taking a closer look at it and trying to understand if there are any regularities in the failure modes of your modeling toolkit, so maybe this is just a lesson for me. Hard to say!
Very cool stuff! Do you have the notebook on colab or something? Kind of want to find out how the story ends, whether that’s in a second half video or just playing around with the code. At the end of this video you had what looked like fairly clean positional embeddings coming out of MLP0. Also the paying-attention-to-self in the second attention layer could plausibly be something to do with erasing the information that comes in on that token, since that’s something that all transformer decoders have to do in some fashion or another.
Pretty sure the loss spikes were coming from using max rather than min when defining the learning rate schedule. Your learning rate multiplier starts at 1 and then linearly increases as step/100 once it reaches 100, which makes sense why it behaves itself for a while and then ultimately diverges for large numbers of steps.
Oops, did not read the post carefully enough, you’ve already linked to the colab!
Yeah, just changing the max to a min produces this much smoother loss curve from your notebook..
This was an amazing article, thank you for posting it!
Side tangent: There’s an annoying paradox that: (1) In RL, there’s no “zero of reward”, you can uniformly add 99999999 to every reward signal and it makes no difference whatsoever; (2) In life, we have a strong intuition that experiences can be good, bad, or neutral; (3) …Yet presumably what our brain is doing has something to do with RL! That “evolutionary prior” I just mentioned is maybe relevant to that? Not sure … food for thought …
The above isn’t quite true in all senses in all RL algorithms. For example, in policy gradient algorithms (http://www.scholarpedia.org/article/Policy_gradient_methods for a good but fairly technical introduction) it is quite important in practice to subtract a baseline value from the reward that is fed into the policy gradient update. (Note that the baseline can be and most profitably is chosen to be dynamic—it’s a function of the state the agent is in. I think it’s usually just chosen to be V(s) = max Q(s,a).) The algorithm will in theory converge to the right value without the baseline, but subtracting the baseline speeds convergence up significantly. If one guesses that the brain is using a policy-gradients-like algorithm, a similar principle would presumably apply. This actually dovetails quite nicely with observed human psychology—good/bad/neutral is a thing, but it seems to be defined largely with respect to our expectation of what was going to happen in the situation we were in. For example, many people get shitty when it turns out they aren’t going to end up having sex that they thought they were going to have—so here the theory would be that the baseline value was actually quite high (they were anticipating a peak experience) and the policy gradients update will essentially treat this as an aversive stimulus, which makes no sense without the existence of the baseline.
It’s closer to being true of Q-learning algorithms, but here too there is a catch—whatever value you assign to never-before-seen states can have a pretty dramatic effect on exploration dynamics, at least in tabular environments (i.e. environments with negligible generalization). So here too one would expect that there is a evolutionarily appropriate level of optimism to apply to genuinely novel situations about which it is difficult to form an a priori judgment, and the difference between this and the value you assign to known situations is at least probably known-to-evolution.