I think Eliezer has oft-made the meta observation you are making now, that simple logical inferences take shockingly long to find in the space of possible inferences. I am reminded of him talking about how long backprop took.
In 1969, Marvin Minsky and Seymour Papert pointed out that Perceptrons couldn’t learn the XOR function because it wasn’t linearly separable. This killed off research in neural networks for the next ten years.
[...]
Then along came this brilliant idea, called “backpropagation”:
You handed the network a training input. The network classified it incorrectly. So you took the partial derivative of the output error (in layer N) with respect to each of the individual nodes in the preceding layer (N − 1). Then you could calculate the partial derivative of the output error with respect to any single weight or bias in the layer N − 1. And you could also go ahead and calculate the partial derivative of the output error with respect to each node in the layer N − 2. So you did layer N − 2, and then N − 3, and so on back to the input layer. (Though backprop nets usually had a grand total of 3 layers.) Then you just nudged the whole network a delta—that is, nudged each weight or bias by delta times its partial derivative with respect to the output error.
It says a lot about the nonobvious difficulty of doing math that it took years to come up with this algorithm.
I find it difficult to put into words just how obvious this is in retrospect. You’re just taking a system whose behavior is a differentiable function of continuous paramaters, and sliding the whole thing down the slope of the error function. There are much more clever ways to train neural nets, taking into account more than the first derivative, e.g. conjugate gradient optimization, and these take some effort to understand even if you know calculus. But backpropagation is ridiculously simple. Take the network, take the partial derivative of the error function with respect to each weight in the network, slide it down the slope.
If I didn’t know the history of connectionism, and I didn’t know scientific history in general—if I had needed to guess without benefit of hindsight how long it ought to take to go from Perceptrons to backpropagation—then I would probably say something like: “Maybe a couple of hours? Lower bound, five minutes—upper bound, three days.”
But at the same time humans are able to construct intricate logical artifacts like the general number field sieve, which seems to require many more steps of longer inferential distance, and each step could only have been made by a small number of specialists in number theory or algebraic number theory available and thinking about factoring algorithms at the time. (Unlike the step in the OP, which seemingly anyone could have made.)
The space of possible inferential steps is very high-dimensional, most steps are difficult, and there’s no known way to strongly bias your policy towards making simple-but-useful steps. Human specialists, therefore, could at best pick a rough direction that leads to accomplishing some goal they have, and then attempt random steps roughly pointed in that direction. Most of those random steps are difficult. A human succeeds if the step’s difficulty is below some threshold, and fails and goes back to square one otherwise. Over time, this results in a biased-random-walk process that stumbles upon a useful application once in a while. If one then looks back, one often sees a sequence of very difficult steps that led to this application (with a bias towards steps at the very upper end of what humans can tackle).
In other words: The space of steps is more high-dimensional than human specialists are numerous, and our motion through it is fairly random. If one picks some state of human knowledge, and considers all directions in which anyone has ever attempted to move from that state, that wouldn’t produce a comprehensive map of that state’s neighbourhood. There’s therefore no reason to expect that all “low-hanging fruits” have been picked, because locating those low-hanging fruits is often harder than picking some high-hanging one.
At this point, I am not surprised by this sort of thing at all, only semi-ironically amused, but I’m not sure whether I can convey why it’s not surprising to me at all (although I surely would be surprised by this if somebody made it salient to me some 5 or 10 years ago).
Perhaps I just got inoculated by reading about people making breakthroughs with simple or obvious in-hindsight concepts or even hearing ideas from people that I thought were obviously relevant/valuable to have in one’s portfolio of models, even though for some reason I hadn’t had it until then, or at least it had been less salient to me than it should have.
Anders Sandberg said that he had had all the pieces of the Grabby Aliens model on the table and only failed to think of an obvious way to put them together.
One frame (of unclear value) I have for this kind of thing is that the complexity/salience/easiness-to-find of an idea before and after is different because, well, a bunch of stuff in the mind is different.
A quick side note: in the 17 years which have passed since the post you cite had been written historiography of connectionism moved on, and we now know that modern backpropagation was invented as early as 1970 and first applied to neural nets in 1982 (technology transfer was much harder before web search!), see https://en.wikipedia.org/wiki/Backpropagation#Modern_backpropagation and references thereof
I think Eliezer has oft-made the meta observation you are making now, that simple logical inferences take shockingly long to find in the space of possible inferences. I am reminded of him talking about how long backprop took.
But at the same time humans are able to construct intricate logical artifacts like the general number field sieve, which seems to require many more steps of longer inferential distance, and each step could only have been made by a small number of specialists in number theory or algebraic number theory available and thinking about factoring algorithms at the time. (Unlike the step in the OP, which seemingly anyone could have made.)
Can you make sense of this?
Here’s a crack at it:
The space of possible inferential steps is very high-dimensional, most steps are difficult, and there’s no known way to strongly bias your policy towards making simple-but-useful steps. Human specialists, therefore, could at best pick a rough direction that leads to accomplishing some goal they have, and then attempt random steps roughly pointed in that direction. Most of those random steps are difficult. A human succeeds if the step’s difficulty is below some threshold, and fails and goes back to square one otherwise. Over time, this results in a biased-random-walk process that stumbles upon a useful application once in a while. If one then looks back, one often sees a sequence of very difficult steps that led to this application (with a bias towards steps at the very upper end of what humans can tackle).
In other words: The space of steps is more high-dimensional than human specialists are numerous, and our motion through it is fairly random. If one picks some state of human knowledge, and considers all directions in which anyone has ever attempted to move from that state, that wouldn’t produce a comprehensive map of that state’s neighbourhood. There’s therefore no reason to expect that all “low-hanging fruits” have been picked, because locating those low-hanging fruits is often harder than picking some high-hanging one.
Generally agree with the caveat that...
...the difficulty of a step is generally somewhat dependent on some contingent properties of a given human mind.
At this point, I am not surprised by this sort of thing at all, only semi-ironically amused, but I’m not sure whether I can convey why it’s not surprising to me at all (although I surely would be surprised by this if somebody made it salient to me some 5 or 10 years ago).
Perhaps I just got inoculated by reading about people making breakthroughs with simple or obvious in-hindsight concepts or even hearing ideas from people that I thought were obviously relevant/valuable to have in one’s portfolio of models, even though for some reason I hadn’t had it until then, or at least it had been less salient to me than it should have.
Anders Sandberg said that he had had all the pieces of the Grabby Aliens model on the table and only failed to think of an obvious way to put them together.
One frame (of unclear value) I have for this kind of thing is that the complexity/salience/easiness-to-find of an idea before and after is different because, well, a bunch of stuff in the mind is different.
A quick side note: in the 17 years which have passed since the post you cite had been written historiography of connectionism moved on, and we now know that modern backpropagation was invented as early as 1970 and first applied to neural nets in 1982 (technology transfer was much harder before web search!), see https://en.wikipedia.org/wiki/Backpropagation#Modern_backpropagation and references thereof