I think this is a totally fine length. But then, I would :P
I still feel like this was a little gears-light. Do the proposed examples of gradient hacking really work if you make a toy neural network with them? (Or does gradient descent find a way around the apparent local minimum?)
I think this is a totally fine length. But then, I would :P
I still feel like this was a little gears-light. Do the proposed examples of gradient hacking really work if you make a toy neural network with them? (Or does gradient descent find a way around the apparent local minimum?)