Scaling laws for neural language models show that cross-entropy loss follows predictable power-law trends with dataset size, model size, and training compute. Since lower cross-entropy corresponds to better next-token prediction, and prediction is closely linked to compression, we should expect improvements in language-model loss to translate into better lossless compression.
This raises a natural question:
Do LLMs exhibit clean power-law scaling in compression performance?
To test this, I use the Pythia model family, which provides models at multiple parameter scales and training checkpoints. Since all Pythia models are trained on the same dataset, The Pile, this gives a relatively controlled setting for comparing compression performance across scale and training progress.
Following Delétang et al. (2023), I use each model’s predicted next-token probabilities as the probability distribution for an arithmetic encoder, which is a near-optimal entropy coder. For the text experiments, I use the first 2048 chunks of enwik8, with each chunk containing 2048 bytes. I also apply the same pipeline to non-text modalities, using ImageNet-1k patches for images and LibriSpeech audio for speech.
Below are the compression ratios obtained across model scales, training checkpoints, and data modalities. I fit a Kaplan-style power law of the form: :
Text Compression
Model
1 k
8 k
32 k
128 k
143 k
pythia-70M
0.223
0.176
0.17
0.173
0.175
pythia-160M
0.218
0.159
0.149
0.149
0.150
pythia-410M
0.223
0.148
0.136
0.129
0.128
pythia-1B
0.207
0.140
0.128
0.120
0.120
pythia-1.4B
0.207
0.137
0.124
0.115
0.115
Image Compression
Model
1 k
8 k
32 k
128 k
143 k
pythia-70M
0.601
0.499
0.492
0.505
0.513
pythia-160M
0.615
0.483
0.471
0.482
0.492
pythia-410M
0.668
0.506
0.461
0.444
0.447
pythia-1B
0.601
0.470
0.456
0.436
0.440
pythia-1.4B
0.643
0.482
0.470
0.434
0.436
Speech Results
Model
1 k
8 k
32 k
128 k
143 k
pythia-70M
0.695
0.460
0.439
0.475
0.466
pythia-160M
0.678
0.440
0.430
0.433
0.456
pythia-410M
0.770
0.505
0.404
0.383
0.391
pythia-1B
0.677
0.424
0.444
0.376
0.384
pythia-1.4B
0.752
0.469
0.443
0.378
0.385
Combined Scaling Curves
The below plot shows the overall scaling laws plots across all the three modalities, While the scaling law trend is present in non-textual modalities, the compression is not as strong in text.
One possible explanation is that this behavior arises from two complementary mechanisms. First, in-context learning may allow the model to rapidly adapt to local structure within a sequence, such as repeating pixel patterns in images or periodic signals in audio. Second, pretraining may induce broad statistical priors over natural data, including Zipfian distributions, heavy tails, and long-range correlations that occur across many modalities. A useful direction for future work would be to disentangle these effects and quantify how their relative contributions evolve during pretraining.
Scaling Laws for LLM Based Data Compression
Scaling laws for neural language models show that cross-entropy loss follows predictable power-law trends with dataset size, model size, and training compute. Since lower cross-entropy corresponds to better next-token prediction, and prediction is closely linked to compression, we should expect improvements in language-model loss to translate into better lossless compression.
This raises a natural question:
To test this, I use the Pythia model family, which provides models at multiple parameter scales and training checkpoints. Since all Pythia models are trained on the same dataset, The Pile, this gives a relatively controlled setting for comparing compression performance across scale and training progress.
Following Delétang et al. (2023), I use each model’s predicted next-token probabilities as the probability distribution for an arithmetic encoder, which is a near-optimal entropy coder. For the text experiments, I use the first 2048 chunks of enwik8, with each chunk containing 2048 bytes. I also apply the same pipeline to non-text modalities, using ImageNet-1k patches for images and LibriSpeech audio for speech.
Below are the compression ratios obtained across model scales, training checkpoints, and data modalities. I fit a Kaplan-style power law of the form:
:
Text Compression
Image Compression
Speech Results
Combined Scaling Curves
The below plot shows the overall scaling laws plots across all the three modalities, While the scaling law trend is present in non-textual modalities, the compression is not as strong in text.
One possible explanation is that this behavior arises from two complementary mechanisms. First, in-context learning may allow the model to rapidly adapt to local structure within a sequence, such as repeating pixel patterns in images or periodic signals in audio. Second, pretraining may induce broad statistical priors over natural data, including Zipfian distributions, heavy tails, and long-range correlations that occur across many modalities. A useful direction for future work would be to disentangle these effects and quantify how their relative contributions evolve during pretraining.