papers for the claims on why mixed activation functions perform worse
No, there are no papers on that topic that I know of. There are relatively few papers that work on mixed activation functions at all. You should understand that papers that don’t show at least a marginal increase on some niche benchmark tend not to get published. So, much of the work on mixed activation functions went unpublished.
But I can link to papers on testing mixed activation functions. Here’s a Bachelor’s thesis from 2022 that did relatively extensive testing. They did evolution of activation function sets for a particular application and got slightly better performance than ReLU/Swish.
That’s an unfair comparison because activation function adaptation to a particular task can improve performance. The thesis did also compare its evolutionary search on single functions, and that approach did about as well as the mixed functions.
So far so good, but then, when the network was scaled up from VGG-HE-2 to VGG-HE-4, their evolved activation sets all got worse, while ReLU and Swish got better. Their best mixed activation set went from 80% to 10% accuracy as the network was scaled up, while the evolved single functions held up better but all became worse than Swish.
One of the issues I mentioned with mixed activation functions is specific to SGD training; there’s also been some work on using them with neuroevolution.
No, there are no papers on that topic that I know of. There are relatively few papers that work on mixed activation functions at all. You should understand that papers that don’t show at least a marginal increase on some niche benchmark tend not to get published. So, much of the work on mixed activation functions went unpublished.
But I can link to papers on testing mixed activation functions. Here’s a Bachelor’s thesis from 2022 that did relatively extensive testing. They did evolution of activation function sets for a particular application and got slightly better performance than ReLU/Swish.
That’s an unfair comparison because activation function adaptation to a particular task can improve performance. The thesis did also compare its evolutionary search on single functions, and that approach did about as well as the mixed functions.
So far so good, but then, when the network was scaled up from VGG-HE-2 to VGG-HE-4, their evolved activation sets all got worse, while ReLU and Swish got better. Their best mixed activation set went from 80% to 10% accuracy as the network was scaled up, while the evolved single functions held up better but all became worse than Swish.
One of the issues I mentioned with mixed activation functions is specific to SGD training; there’s also been some work on using them with neuroevolution.