bzip2 is known to be both slow and not too great at compression; what does lzma-2 (faster & smaller) get you on Wikipedia?
(Also, I would expect redirects to play in a compression algorithm’s favor compared to natural language. A redirect almost always takes the stereotypical form #REDIRECT[[foo]] or #redirect[[foo]]. It would have difficulty compressing the target, frequently a proper name, but the other 13 characters? Pure gravy.)
Here are the numbers for a pre-LZMA2 version of 7zip. It looks like LZMA is 2.0 bits per byte, while some other option is 1.7 bits per byte.
Yes, I would expect wikipedia to compress more than text, but it doesn’t seem to be so. This is just for the first 100MB. At a gig, all compression programs do dramatically better, even off-the-shelf ones that shouldn’t window that far. Maybe there is a lot of random vandalism early in the alphabet?
Well, early on there are many weirdly titled pages, and I could imagine that the first 100MB includes all the ’1958 in British Tennis’-style year articles. But intuitively that doesn’t feel like enough to cause bad results.
Nor have any of the articles or theses I’ve read on vandalism detection noted any unusual distributions of vandalism; further, obvious vandalism like gibberish/high-entropy-strings are the very least long-lived forms of vandalism—long-lived vandalism looks plausible & correct, and indistinguishable from normal English even to native speakers (much less a compression algorithm).
A window really does sound like the best explanation, until someone tries out 100MB chunks from other areas of Wikipedia and finds they compress comparably to 1GB.
bzip’s window is 900k, yet it compresses 100MB to 29% but 1GB to 25%. Increasing the memory on 7zip’s PPM makes a larger difference on 1GB than 100MB, so maybe it’s the window that’s relevant there, but it doesn’t seem very plausible to me. (18.5% → 17.8% vs 21.3% → 21.1%)
Sporting lists might compress badly, especially if they contain times, but this one seems to compress well.
bzip2 is known to be both slow and not too great at compression; what does lzma-2 (faster & smaller) get you on Wikipedia?
(Also, I would expect redirects to play in a compression algorithm’s favor compared to natural language. A redirect almost always takes the stereotypical form
#REDIRECT[[foo]]
or#redirect[[foo]]
. It would have difficulty compressing the target, frequently a proper name, but the other 13 characters? Pure gravy.)Here are the numbers for a pre-LZMA2 version of 7zip. It looks like LZMA is 2.0 bits per byte, while some other option is 1.7 bits per byte.
Yes, I would expect wikipedia to compress more than text, but it doesn’t seem to be so. This is just for the first 100MB. At a gig, all compression programs do dramatically better, even off-the-shelf ones that shouldn’t window that far. Maybe there is a lot of random vandalism early in the alphabet?
Well, early on there are many weirdly titled pages, and I could imagine that the first 100MB includes all the ’1958 in British Tennis’-style year articles. But intuitively that doesn’t feel like enough to cause bad results.
Nor have any of the articles or theses I’ve read on vandalism detection noted any unusual distributions of vandalism; further, obvious vandalism like gibberish/high-entropy-strings are the very least long-lived forms of vandalism—long-lived vandalism looks plausible & correct, and indistinguishable from normal English even to native speakers (much less a compression algorithm).
A window really does sound like the best explanation, until someone tries out 100MB chunks from other areas of Wikipedia and finds they compress comparably to 1GB.
bzip’s window is 900k, yet it compresses 100MB to 29% but 1GB to 25%. Increasing the memory on 7zip’s PPM makes a larger difference on 1GB than 100MB, so maybe it’s the window that’s relevant there, but it doesn’t seem very plausible to me. (18.5% → 17.8% vs 21.3% → 21.1%)
Sporting lists might compress badly, especially if they contain times, but this one seems to compress well.
That’s very odd. If you ever find out what is going on here, I’d appreciate knowing.