Oam Patel comments on [missing post]

Oam Patel 28 Jun 2022 19:07 UTC
1 point
0
For a 2-component weighted average model with a scalar output, the output should always be between between the outputs of each component model.
Hm, I see your point. I retract my earlier claim. This model wouldn’t apply to that task. I’m struggling to generate a concrete example where loss would actually be a linear combination of the sub-models’ loss. However, I (tentatively) conjecture that in large networks trained on complex tasks, loss can be roughly approximated as a linear combination of the losses of subnetworks (with the caveats of weird correlations and tasks where partial combinations work well (like the function approximation above)).
I would expect under normal circumstances that gradient descent would also be optimizing the parameters within that part or those layers.
I agree, but the question of in what direction SGD changes the model (i.e. how it changes $f$ ) seems to have some recursive element analogous to the situation above. If the model is really close to the $f$ above, then I would imagine there’s some optimization pressure to update it towards $f$ . That’s just a hunch, though. I don’t know how close it would have to be.