I think the problem is that many things along the rough lines of what you’re describing have been attempted in the past, and have turned out to work not-so-well (TBF, they also were attempted with older systems: not even sure if anyone’s tried to make something like an expert system with a full-fledged transformer). The common wisdom the field has derived from those experiences seems to have been “stochastic gradient descent knows best, just throw your data into a function and let RNJesus sort it out”. Which is… not the best lesson IMO. I think there might be worth in the things you suggest and they are intuitively appealing to me too. But as it turns out when an industry is driven by racing to the goal rather than a genuine commitment to proper scientific understanding they end up taking all sorts of shortcuts. Who could have guessed.
I think the problem is that many things along the rough lines of what you’re describing have been attempted in the past, and have turned out to work not-so-well (TBF, they also were attempted with older systems: not even sure if anyone’s tried to make something like an expert system with a full-fledged transformer). The common wisdom the field has derived from those experiences seems to have been “stochastic gradient descent knows best, just throw your data into a function and let RNJesus sort it out”. Which is… not the best lesson IMO. I think there might be worth in the things you suggest and they are intuitively appealing to me too. But as it turns out when an industry is driven by racing to the goal rather than a genuine commitment to proper scientific understanding they end up taking all sorts of shortcuts. Who could have guessed.