but it would benefit a lot from a more in depth review of the existing literature and methodologies
This is definitely not wrong. Probably depends a bit on what one sees as the goal here though—in my case, after weeks of readings and exercises, I was in dire need of basically just playing around with things more freely. So it was less about “how can I create the most valuable possible lesswrong post”, and more “I’m going to do this thing anyway, and perhaps documenting the project in a lesswrong post is better than not doing so”. I’m not totally confident about that though.
Thanks a lot for the list! Very interesting points that give me some ideas of where to go next.
the piecewise approximation thing is a pretty widely accepted opinion in ML (...) local approximation of the loss landscape as approximately quadratic is pretty widely accepted
Sounds like good news to me! Certainly not the worst thing if the thing I found happens to be widely accepted. (also makes sense now that in both cases, Claude predicted these outcomes (even though I only asked it after getting to these findings—but asked as if I didn’t know what to expect) - I guess it was less about Claude being very smart then and more about these things being relatively well established anyway)
a factor of 3 loss difference is huge! if you want to claim that smooth actfn is better beyond what’s explained by the loss, you need to compare two models with the same loss but different actfn
Fair point! My coverage of that in the post wasn’t great. What I didn’t mention there is that during the rendering of the two videos of model3 and model4 “learning Mandelbrot”, model4 had the shape (including “fractal depth”) down way before even getting close to the loss that model3 had at the end. So even with much lower loss, the ReLU Mandelbrot looked much worse than that of SiLU. But I’ll look into that again and make a proper comparison.
Thanks for the comment!
This is definitely not wrong. Probably depends a bit on what one sees as the goal here though—in my case, after weeks of readings and exercises, I was in dire need of basically just playing around with things more freely. So it was less about “how can I create the most valuable possible lesswrong post”, and more “I’m going to do this thing anyway, and perhaps documenting the project in a lesswrong post is better than not doing so”. I’m not totally confident about that though.
Thanks a lot for the list! Very interesting points that give me some ideas of where to go next.
Sounds like good news to me! Certainly not the worst thing if the thing I found happens to be widely accepted. (also makes sense now that in both cases, Claude predicted these outcomes (even though I only asked it after getting to these findings—but asked as if I didn’t know what to expect) - I guess it was less about Claude being very smart then and more about these things being relatively well established anyway)
Fair point! My coverage of that in the post wasn’t great. What I didn’t mention there is that during the rendering of the two videos of model3 and model4 “learning Mandelbrot”, model4 had the shape (including “fractal depth”) down way before even getting close to the loss that model3 had at the end. So even with much lower loss, the ReLU Mandelbrot looked much worse than that of SiLU. But I’ll look into that again and make a proper comparison.