I’ll conjecture the following in a VERY SPECULATIVE, inflammatory, riff-on-vibes statements:
Gradient descent solves problem in the complexity class P[1]. It is P-Complete.
Learning theory (and complexity theory) have for decades been pushing two analogous bad narratives about the weakness of gradient descent (and P).
These narratives dominate because it is easy prove impossibility results like “Problem X can’t be solved by gradient descent” (or “Problem Y is NP-Hard”). It’s academically fecund—it’s a subject aspiring academics can write a lot of papers about. Results about what gradient descent (and polynomial time) can’t do compose a fair portion of the academic canon
In practice, these impossible results are corner cases cases don’t actually come up. The “vibes” of these impossibility results run counter to the “vibes” of reality
Example, gradient descent solves most problems, even though it theoretically it gets trapped in local minima. (SAT is in practice fast to solve, even though in theory it’s theoretical computer science’s canonical Hard-Problem-You-Say-Is-Impossible-To-Solve-Quickly)
The vibe of reality is “local (greedy) algorithms usually work”
Stoner-vibes based reason: I’m guessing you can reduce a problem like Horn Satisfiability[2] to gradient descent. Horn Satisfiability is a P-compete problem—you can transform any polynomial-time decision problem in a Horn Satisfiability problem using a log-space transformation. Therefore, gradient descent is “at least as big as P” (P-hard). And I’m guessing you can your formalization of gradient descent in P as well (hence “P-Complete”). That would mean gradient descent is not be able to solve harder problems in e.g. NP unless P=NP
Horn Satisfiability is about finding true/false values that satisfy a bunch of logic clauses of the form a∧c∧d→e. or b∧e→⊥ (that second clause means “don’t set both b and e to true—at least one of them has to be false” ). In the algorithm for solving it, you figure out a variable that must be set to true or false, then propagate that information forward to other clauses. I bet you can do this with a loss function turning into a greedy search on a hypercube.
I’ll conjecture the following in a VERY SPECULATIVE, inflammatory, riff-on-vibes statements:
Gradient descent solves problem in the complexity class P[1]. It is P-Complete.
Learning theory (and complexity theory) have for decades been pushing two analogous bad narratives about the weakness of gradient descent (and P).
These narratives dominate because it is easy prove impossibility results like “Problem X can’t be solved by gradient descent” (or “Problem Y is NP-Hard”). It’s academically fecund—it’s a subject aspiring academics can write a lot of papers about. Results about what gradient descent (and polynomial time) can’t do compose a fair portion of the academic canon
In practice, these impossible results are corner cases cases don’t actually come up. The “vibes” of these impossibility results run counter to the “vibes” of reality
Example, gradient descent solves most problems, even though it theoretically it gets trapped in local minima. (SAT is in practice fast to solve, even though in theory it’s theoretical computer science’s canonical Hard-Problem-You-Say-Is-Impossible-To-Solve-Quickly)
The vibe of reality is “local (greedy) algorithms usually work”
Stoner-vibes based reason: I’m guessing you can reduce a problem like Horn Satisfiability[2] to gradient descent. Horn Satisfiability is a P-compete problem—you can transform any polynomial-time decision problem in a Horn Satisfiability problem using a log-space transformation. Therefore, gradient descent is “at least as big as P” (P-hard). And I’m guessing you can your formalization of gradient descent in P as well (hence “P-Complete”). That would mean gradient descent is not be able to solve harder problems in e.g. NP unless P=NP
Horn Satisfiability is about finding true/false values that satisfy a bunch of logic clauses of the form a∧c∧d→e. or b∧e→⊥ (that second clause means “don’t set both b and e to true—at least one of them has to be false” ). In the algorithm for solving it, you figure out a variable that must be set to true or false, then propagate that information forward to other clauses. I bet you can do this with a loss function turning into a greedy search on a hypercube.