My understanding is the vanishing gradient problem has been largely mitigated by introducing skip connections (first with resnet, and now standard in CNN architectures), allowing for networks with hundreds of layers.
Does this actually solve the problem, or just mask it? Skip connections end up with a bunch of shallow networks in parallel with deeper networks, to an over-approximation. If the shallow portions end up training faster and out-competing the deeper portions...
Does this actually solve the problem, or just mask it? Skip connections end up with a bunch of shallow networks in parallel with deeper networks, to an over-approximation. If the shallow portions end up training faster and out-competing the deeper portions...