Bayes is optimal if you throw all your knowledge into the equation. That’s practically infeasible, so this program throws away most of the data first, and then applies a heuristic to the remaining data. There’s no guarantee that applying Bayes’ Rule directly to the remaining data will outperform another heuristic, just that both would be outperformed by running the ideal version of Bayes (and including everything we know about grammar, among other missing data).
I don’t follow. The heuristic isn’t using any of the thrown-away data, its just using the same data they used Bayes on. That is to say, someone who had only the information that Norvig actually used would be able to apply Bayes using all their knowledge or they could use this heuristic with all their knowledge and the heuristic would come out better.
This could possibly be explained if the heuristic also embodied some background information that allows us to correct for overconfidence if P(f|e) that isn’t explicitly mentioned in the data, as suggested by DanielVarga, or if the heuristic was effectively preforming an expected utility calculation, as suggested in my other comment.
This could possibly be explained if the heuristic also embodied some background information that allows us to correct for overconfidence if P(f|e) that isn’t explicitly mentioned in the data
Exactly. If there’s some structure to the full dataset that’s unrecoverable from the part that’s kept, you can code that structure into a heuristic which will outperform Naive Bayes on the remaining data- but an ideal Bayesian reasoner with access to the full dataset would have picked up that structure as well, and you wouldn’t be outperforming xer.
So the post is evidence of interesting structure in word frequencies, not a deficiency of Bayes’ Rule.
Bayes is optimal if you throw all your knowledge into the equation. That’s practically infeasible, so this program throws away most of the data first, and then applies a heuristic to the remaining data. There’s no guarantee that applying Bayes’ Rule directly to the remaining data will outperform another heuristic, just that both would be outperformed by running the ideal version of Bayes (and including everything we know about grammar, among other missing data).
I don’t follow. The heuristic isn’t using any of the thrown-away data, its just using the same data they used Bayes on. That is to say, someone who had only the information that Norvig actually used would be able to apply Bayes using all their knowledge or they could use this heuristic with all their knowledge and the heuristic would come out better.
This could possibly be explained if the heuristic also embodied some background information that allows us to correct for overconfidence if P(f|e) that isn’t explicitly mentioned in the data, as suggested by DanielVarga, or if the heuristic was effectively preforming an expected utility calculation, as suggested in my other comment.
Exactly. If there’s some structure to the full dataset that’s unrecoverable from the part that’s kept, you can code that structure into a heuristic which will outperform Naive Bayes on the remaining data- but an ideal Bayesian reasoner with access to the full dataset would have picked up that structure as well, and you wouldn’t be outperforming xer.
So the post is evidence of interesting structure in word frequencies, not a deficiency of Bayes’ Rule.