Can basically attest to all of these, been doing intensive ML upskilling for the last half a year and almost all of these have been true. Highlights include:
Not properly setting up the attention mechanism in multiple experiments, resulting in the conclusion that attention didn’t do much (lmao)
So, so many off-by-one, off-by-two errors, especially for next-token prediction setups
Entire series of weeks-long experiments that turn out to be completely useless (usually based on a seemingly-reasonable intuition of some kind)
Accidentally overwriting/resetting the residual element so the RNN was just an NN with a funky hat on
I now hate shapes, reshaping, squeezing, unsqueezing, devices, torch.nn.functional.pad, so many more functions
Using the wrong loss function
Using the right loss function but with the wrong reduction
Using the right loss function but the learning rate is too aggressive/too low/the optimiser is not initialised properly
Using all the right things but loading the model from an incorrect checkpoint/not saving the weights properly
And also learning that google colab was forged in mount doom, a tool of great power crafted with malicious intent.
Can basically attest to all of these, been doing intensive ML upskilling for the last half a year and almost all of these have been true. Highlights include:
Not properly setting up the attention mechanism in multiple experiments, resulting in the conclusion that attention didn’t do much (lmao)
So, so many off-by-one, off-by-two errors, especially for next-token prediction setups
Entire series of weeks-long experiments that turn out to be completely useless (usually based on a seemingly-reasonable intuition of some kind)
Accidentally overwriting/resetting the residual element so the RNN was just an NN with a funky hat on
I now hate shapes, reshaping, squeezing, unsqueezing, devices, torch.nn.functional.pad, so many more functions
Using the wrong loss function
Using the right loss function but with the wrong reduction
Using the right loss function but the learning rate is too aggressive/too low/the optimiser is not initialised properly
Using all the right things but loading the model from an incorrect checkpoint/not saving the weights properly
And also learning that google colab was forged in mount doom, a tool of great power crafted with malicious intent.
Are you using einops and einsum? I hate these somewhat less since using them. See here for more details.