You are absolutely right—and the references are great. Do you happen to have access to copies that you can send? It’s a bit hard to know what’s proven and what’s not here since a lot of the papers are paywalled.
Sumio Watanabe actually emailed me and pointed this out as well. I had a cached memory of rlct(0) being width/2 (so dim/4) in the analytic activation case, which was incorrect. In fact in the paper Watanabe sent me there was only an upper bound, so I wrote up a quick note giving a rough lower bound of the same order. I was planning to update this post as soon as it’s on arxiv, but if the paper you mentioned has a lower bound then that’s great, and I can cite it.
I think this doesn’t change the fundamental issue though. The free energy here is bounded by
Thanks a lot for this!
Yes, for a linear neural net the RLCT is much lower. You in fact get similarly low RLCT if your activation function has a “sparse” Taylor series such as a theta function. If I’m not mistaken, in order to get a lower bound on the RLCT of type you need to assume that the Taylor series of the activation function has a positive density of nonzero terms.