Meta point: this is one of those insights which is very likely to hit you over the head if you’re doing practical technical work with probabilitistic models, but not if you’re just using them for small semi-intuitive problems (a use-case we often see on LW).
I remember the first time I wrote a mixture of gaussians clustering model, and saw it spitting out probabilities like 10^-5000, and thought it must be a bug. It wasn’t a bug. Probabilities naturally live on a log scale, and those sorts of numbers are normal once we move away from artificially-low-dimensional textbook problems and start working with more realistic high-dimensional systems. When your data channel has a capacity of kilobytes or megabytes per data point, even if 99% of that information is irrelevant, that’s still a lot of bits; the probabilities get exponentially small very quickly.
Tying back to an example in the post: if we’re using ascii encoding, then the string “Mark Xu” takes up 49 bits. It’s quite compressible, but that still leaves more than enough room for 24 bits of evidence to be completely reasonable.
Tying back to an example in the post: if we’re using ascii encoding, then the string “Mark Xu” takes up 49 bits. It’s quite compressible, but that still leaves more than enough room for 24 bits of evidence to be completely reasonable.
This paper suggests that spoken language is consistently ~39bits/second.
Meta point: this is one of those insights which is very likely to hit you over the head if you’re doing practical technical work with probabilitistic models, but not if you’re just using them for small semi-intuitive problems (a use-case we often see on LW).
I remember the first time I wrote a mixture of gaussians clustering model, and saw it spitting out probabilities like 10^-5000, and thought it must be a bug. It wasn’t a bug. Probabilities naturally live on a log scale, and those sorts of numbers are normal once we move away from artificially-low-dimensional textbook problems and start working with more realistic high-dimensional systems. When your data channel has a capacity of kilobytes or megabytes per data point, even if 99% of that information is irrelevant, that’s still a lot of bits; the probabilities get exponentially small very quickly.
Tying back to an example in the post: if we’re using ascii encoding, then the string “Mark Xu” takes up 49 bits. It’s quite compressible, but that still leaves more than enough room for 24 bits of evidence to be completely reasonable.
This paper suggests that spoken language is consistently ~39bits/second.
https://advances.sciencemag.org/content/5/9/eaaw2594