For a 2-component weighted average model with a scalar output, the output should always be between between the outputs of each component model.

Hm, I see your point. I retract my earlier claim. This model wouldn’t apply to that task. I’m struggling to generate a concrete example where loss would actually be a linear combination of the sub-models’ loss. However, I (tentatively) conjecture that in large networks trained on complex tasks, loss can be roughly approximated as a linear combination of the losses of subnetworks (with the caveats of weird correlations and tasks where partial combinations work well (like the function approximation above)).

I would expect under normal circumstances that gradient descent would also be optimizing the parameters within that part or those layers.

I agree, but the question of in what direction SGD changes the model (i.e. how it changes f) seems to have some recursive element analogous to the situation above. If the model is really close to the f above, then I would imagine there’s some optimization pressure to update it towards f. That’s just a hunch, though. I don’t know how close it would have to be.

Hm, I see your point. I retract my earlier claim. This model wouldn’t apply to that task. I’m struggling to generate a concrete example where loss would

actuallybe a linear combination of the sub-models’ loss. However, I (tentatively) conjecture that in large networks trained on complex tasks, loss can be roughly approximated as a linear combination of the losses of subnetworks (with the caveats of weird correlations and tasks where partial combinations work well (like the function approximation above)).I agree, but the question of in what direction SGD changes the model (i.e. how it changes f) seems to have some recursive element analogous to the situation above. If the model is really close to the f above, then I would imagine there’s

someoptimization pressure to update it towards f. That’s just a hunch, though. I don’t know how close it would have to be.