Yeah, good question. I think the word “data-dependent” has different connotations (even if it is standard terminology).
Using the sketch definition
With high probability over possible training sets S, for all h in the hypothesis class, we have |expected test error of hypothesis h—empirical error of h on S| ⇐ (Some bound involving the size of the training data and high level properties of h).[2]
You’re right that properties of h are, in general different from properties of the data. The “data-dependent” part enters this inequality when the right hand side depends on properties of the learned hypothesis , which depend on the training data you sampled . In classical bounds, the RHS depends only on properties of the class H (VC dim, Rademacher complexity of the whole class), not on any particular h. Those give the same number for every S. Meanwhile, the spectral-norm bounds described in that section of the post will depend on the weights of the learned network (and are, as a rule, higher on memorizing solutions than generalizing ones).
(Of course, a sufficiently nitpicky person might argue that the data-dependent bounds are uniform-convergence bounds over an implicit, S-indexed sub-class — “all h’ with ‖W’‖_spec ≤ ‖W(S)‖_spec”. But given this sub-class is S-indexed, I think it’s still fair to call the bound data-dependent.)
I think this is a reasonable confusion, and I’ll expand the footnote to clarify.
Re: uniform convergence bounds, you say
I’m confused—aren’t properties of h different from properties of the data?
Yeah, good question. I think the word “data-dependent” has different connotations (even if it is standard terminology).
Using the sketch definition
You’re right that properties of h are, in general different from properties of the data. The “data-dependent” part enters this inequality when the right hand side depends on properties of the learned hypothesis , which depend on the training data you sampled . In classical bounds, the RHS depends only on properties of the class H (VC dim, Rademacher complexity of the whole class), not on any particular h. Those give the same number for every S. Meanwhile, the spectral-norm bounds described in that section of the post will depend on the weights of the learned network (and are, as a rule, higher on memorizing solutions than generalizing ones).
(Of course, a sufficiently nitpicky person might argue that the data-dependent bounds are uniform-convergence bounds over an implicit, S-indexed sub-class — “all h’ with ‖W’‖_spec ≤ ‖W(S)‖_spec”. But given this sub-class is S-indexed, I think it’s still fair to call the bound data-dependent.)
I think this is a reasonable confusion, and I’ll expand the footnote to clarify.
So is that h not part of the universal quantification over h in H?
Oh I currently think the thing that’s going on is that it’s a hypothesis-dependent bound that you then apply to the hypothesis learned from the data.
Yep