One very important observation related to this issue is the fact that we often observe specific cognitive deficits (e.g. people who can’t use nouns) but those specific deficits are almost always related to a brain trauma (stroke, etc.) If there were significant cognitive logic coded into the genome, we should see specific cognitive deficits in otherwise healthy young people caused by mutations.
I’m not sure exactly what you mean, but I’ll guess you mean “how do you deal with the problem that there are an infinite number of tests for randomness that you could apply?”
I don’t have a principled answer. My practical answer is just to use good intuition and/or taste to define a nice suite of tests, and then let the algorithm find the ones that show the biggest randomness deficiencies. There’s probably a better way to do this with differentiable programming—I finished my Phd in 2010, before the deep learning revolution.
In my Phd thesis I explored an extension of the compression/modeling equivalence that’s motivated by Algorithmic Information Theory. AIT says that if you have a “perfect” model of a data set, then the bitstream created by encoding the data using the model will be completely random. Every statistical test for randomness applied to the bitstream will return the expected value. For example, the proportion of 1s should be 0.5, the proportion of 1s following the prefix 010 should be 0.5, etc etc. Conversely, if you find a “randomness deficiency”, you have found a shortcoming of your model. And it turns out you can use this info to create an improved model.
That gives us an alternative conceptual approach to modeling/optimization. Instead of maximizing a log-likelihood, take an initial model, encode the dataset, and then search the resulting bitstream for randomness deficiencies. This is very powerful because there is an infinite number of randomness tests that you can apply. Once you find a randomness deficiency, you can use it to create an improved model, and repeat the process until the bitstream appears completely random.
The key trick that made the idea practical is that you can use “pits” instead of bits. Bits are tricky, because as your model gets better, the number of bits goes down—that’s the whole point—so the relationship between bits and the original data samples gets murky. A “pit” is a [0,1) value calculated by applying the Probability Integral Transform to the data samples using the model. The same randomness requirements hold for the pitstream as for the bitstream, and there are always as many pits as data samples. So now you can define randomness tests based on intuitive contexts functions, like “how many pits are in the [0.2,0.4] interval when the previous word in the original text was a noun?”
Cool concepts! What tech stack did you use? Was it painful to get the Facebook API working?
Not a stupid question, this issue is actually addressed in the essay, in the section about interior modeling vs unsupervised learning. The latter is very vague and general, while the former is much more specific and also intrinsically difficult. The difficulty and preciseness of the objective make it much better as a goal for a research community.
I started this essay last year, and procrastinated on completing it for a long time, until recently the GPT-3 announcement gave me the motivation to finish it up.
If you are familiar with my book, you will notice some of the same ideas, expressed with different emphasis. I congratulate myself a bit on predicting some of the key aspects of the GPT-3 breakthrough (data annotation doesn’t scale; instead learn highly complex interior models from raw data).
I would appreciate constructive feedback and signal-boosting.
I would add two ideas:
Try to find a good role model—someone who is similar to you in relevant respects, is a couple of years ahead of you, who has done something you think is awesome, and who you can talk to and observe to some extent. Bill Gates is probably not a good role model.
Try to form a realistic assessment of how important college actually is; people often err in imagining it to be more or less important than it is in reality (these errors seem to be correlated with social class). I would estimate that the 4 years of college are only modestly more important than other years of your life. What you do right after college is important. What you do when you’re in your late 20s is important.
Holden is a smart guy, but he’s also operating under a severe set of political constraints, since his organization depends so strongly on its ability to raise funds. So we shouldn’t make too much of the fact that he thinks academia is pretty good—obviously he’s going to say that.
Interesting analysis. I hadn’t heard of Goodman before so I appreciate the reference.
In my view the problem of induction has been almost entirely solved by the ideas from the literature on statistical learning, such as VC theory, MDL, Solomonoff induction, and PAC learning. You might disagree, but you should probably talk about why those ideas prove insufficient in your view if you want to convince people (especially if your audience is up-to-date on ML).
One particularly glaring limitation with Goodman’s argument is that it depends on natural language predicates (“green”, “grue”, etc). Natural language is terribly ambiguous and imprecise, which makes it hard to evaluate philosophical statements about natural language predicates. You’d be better off casting the discussion in terms of computer programs, that take a given set of input observations and produce an output prediction.
Of course you could write “green” and “grue” as computer functions, but it would be immediately obvious how much more contrived the program using “grue” is than the program using “green”.