Very nice post. It is certainly useful to do this exercise of manually encoding language rules into the weights of a transformer in order to better understand the machinery involved.
“The ultimate ambition of this work would be to go toe-to-toe with a comparably-sized Transformer model trained in the traditional way on a modern-sized data set. This might require several people-years of focused effort though.”
There is a long history of attempting to parse natural language with hand design rules and heuristics. The general consensus now is that hand engineering is insufficient, and some learning from data is necessary. To me it seems that this direction inherits the problems of these old fashioned language systems since you are codifying your own hand designed heuristics and rules into the network weights.
Do you see a way to introduce learning from data without sacrificing the interpretability that your approach provides?
Very nice post. It is certainly useful to do this exercise of manually encoding language rules into the weights of a transformer in order to better understand the machinery involved.
There is a long history of attempting to parse natural language with hand design rules and heuristics. The general consensus now is that hand engineering is insufficient, and some learning from data is necessary. To me it seems that this direction inherits the problems of these old fashioned language systems since you are codifying your own hand designed heuristics and rules into the network weights.
Do you see a way to introduce learning from data without sacrificing the interpretability that your approach provides?