While I am optimistic about simple algorithmic changes improving the interpretability situation (the difference between L1 and L2 regularization seems like a big example of hope here, for example), I think the difficulty floor is determined by the complexity of the underlying subject matter that needs to be encoded, and for LLMs / natural language that’s going to be very complex. (And if you use an architecture that can’t support things that are as complex as the underlying subject matter, the optimal model for that architecture will correspondingly have high loss.)
While I am optimistic about simple algorithmic changes improving the interpretability situation (the difference between L1 and L2 regularization seems like a big example of hope here, for example), I think the difficulty floor is determined by the complexity of the underlying subject matter that needs to be encoded, and for LLMs / natural language that’s going to be very complex. (And if you use an architecture that can’t support things that are as complex as the underlying subject matter, the optimal model for that architecture will correspondingly have high loss.)