AFAIK the smoothness can add useful properties at training time, because the gradient is more well-behaved around zero. And ReLUs won over sigmoids because not being flat on both sides allowed their gradients to propagate better across several layers (whereas with a sigmoid as soon as you cap on either side the gradient becomes zero and it becomes very hard to dislodge the system from that local minimum).
NNs are weird functions but I don’t think you can really describe with smooth manifolds most stuff you do with ML. Kolmogorov-Arnold function approximators, which are sorta related to NNs (really NNs are a subset of K-A approximators), are known to be weird functions, not necessarily smooth. And lots of problems, like classification problem (which btw is essentially what text prediction is) aren’t smooth to begin with.
There is some intuition that you have to enforce some sort of smooth-like property as a way of generalizing the knowledge and combating overfitting; that’s what regularization is for. But it’s all very vibey. What you would really need is a proper universal prior for your function that you then update with your training data, and we have no idea what that looks like—only empirical knowledge that some shit seems to work better for whatever reason.
AFAIK the smoothness can add useful properties at training time, because the gradient is more well-behaved around zero. And ReLUs won over sigmoids because not being flat on both sides allowed their gradients to propagate better across several layers (whereas with a sigmoid as soon as you cap on either side the gradient becomes zero and it becomes very hard to dislodge the system from that local minimum).
NNs are weird functions but I don’t think you can really describe with smooth manifolds most stuff you do with ML. Kolmogorov-Arnold function approximators, which are sorta related to NNs (really NNs are a subset of K-A approximators), are known to be weird functions, not necessarily smooth. And lots of problems, like classification problem (which btw is essentially what text prediction is) aren’t smooth to begin with.
There is some intuition that you have to enforce some sort of smooth-like property as a way of generalizing the knowledge and combating overfitting; that’s what regularization is for. But it’s all very vibey. What you would really need is a proper universal prior for your function that you then update with your training data, and we have no idea what that looks like—only empirical knowledge that some shit seems to work better for whatever reason.