One of the big surprises of the deep learning revolution has been the universality of gradient descent optimization.
How large is the class of optimization problems that we can transform into a gradient descent problem of some kind? My suspicision is that it’s a very large class; perhaps there is even a general way to transform any problem into a gradient descent optimization problem?
The natural thing that comes to mind is to consider gradient descent of Langrangian energy functionals in (optimal) control theory.
I’ll conjecture the following in a VERY SPECULATIVE, inflammatory, riff-on-vibes statements:
Gradient descent solves problem in the complexity class P[1]. It is P-Complete.
Learning theory (and complexity theory) have for decades been pushing two analogous bad narratives about the weakness of gradient descent (and P).
These narratives dominate because it is easy prove impossibility results like “Problem X can’t be solved by gradient descent” (or “Problem Y is NP-Hard”). It’s academically fecund—it’s a subject aspiring academics can write a lot of papers about. Results about what gradient descent (and polynomial time) can’t do compose a fair portion of the academic canon
In practice, these impossible results are corner cases cases don’t actually come up. The “vibes” of these impossibility results run counter to the “vibes” of reality
Example, gradient descent solves most problems, even though it theoretically it gets trapped in local minima. (SAT is in practice fast to solve, even though in theory it’s theoretical computer science’s canonical Hard-Problem-You-Say-Is-Impossible-To-Solve-Quickly)
The vibe of reality is “local (greedy) algorithms usually work”
Stoner-vibes based reason: I’m guessing you can reduce a problem like Horn Satisfiability[2] to gradient descent. Horn Satisfiability is a P-compete problem—you can transform any polynomial-time decision problem in a Horn Satisfiability problem using a log-space transformation. Therefore, gradient descent is “at least as big as P” (P-hard). And I’m guessing you can your formalization of gradient descent in P as well (hence “P-Complete”). That would mean gradient descent is not be able to solve harder problems in e.g. NP unless P=NP
Horn Satisfiability is about finding true/false values that satisfy a bunch of logic clauses of the form a∧c∧d→e. or b∧e→⊥ (that second clause means “don’t set both b and e to true—at least one of them has to be false” ). In the algorithm for solving it, you figure out a variable that must be set to true or false, then propagate that information forward to other clauses. I bet you can do this with a loss function turning into a greedy search on a hypercube.
I’m actually curious about a related problem.
One of the big surprises of the deep learning revolution has been the universality of gradient descent optimization.
How large is the class of optimization problems that we can transform into a gradient descent problem of some kind? My suspicision is that it’s a very large class; perhaps there is even a general way to transform any problem into a gradient descent optimization problem?
The natural thing that comes to mind is to consider gradient descent of Langrangian energy functionals in (optimal) control theory.
I’ll conjecture the following in a VERY SPECULATIVE, inflammatory, riff-on-vibes statements:
Gradient descent solves problem in the complexity class P[1]. It is P-Complete.
Learning theory (and complexity theory) have for decades been pushing two analogous bad narratives about the weakness of gradient descent (and P).
These narratives dominate because it is easy prove impossibility results like “Problem X can’t be solved by gradient descent” (or “Problem Y is NP-Hard”). It’s academically fecund—it’s a subject aspiring academics can write a lot of papers about. Results about what gradient descent (and polynomial time) can’t do compose a fair portion of the academic canon
In practice, these impossible results are corner cases cases don’t actually come up. The “vibes” of these impossibility results run counter to the “vibes” of reality
Example, gradient descent solves most problems, even though it theoretically it gets trapped in local minima. (SAT is in practice fast to solve, even though in theory it’s theoretical computer science’s canonical Hard-Problem-You-Say-Is-Impossible-To-Solve-Quickly)
The vibe of reality is “local (greedy) algorithms usually work”
Stoner-vibes based reason: I’m guessing you can reduce a problem like Horn Satisfiability[2] to gradient descent. Horn Satisfiability is a P-compete problem—you can transform any polynomial-time decision problem in a Horn Satisfiability problem using a log-space transformation. Therefore, gradient descent is “at least as big as P” (P-hard). And I’m guessing you can your formalization of gradient descent in P as well (hence “P-Complete”). That would mean gradient descent is not be able to solve harder problems in e.g. NP unless P=NP
Horn Satisfiability is about finding true/false values that satisfy a bunch of logic clauses of the form a∧c∧d→e. or b∧e→⊥ (that second clause means “don’t set both b and e to true—at least one of them has to be false” ). In the algorithm for solving it, you figure out a variable that must be set to true or false, then propagate that information forward to other clauses. I bet you can do this with a loss function turning into a greedy search on a hypercube.