Scaling Laws for LLM Based Data Compression

Scaling laws for neural language models show that cross-entropy loss follows predictable power-law trends with dataset size, model size, and training compute. Since lower cross-entropy corresponds to better next-token prediction, and prediction is closely linked to compression, we should expect improvements in language-model loss to translate into better lossless compression.

This raises a natural question:

Do LLMs exhibit clean power-law scaling in compression performance?

To test this, I use the Pythia model family, which provides models at multiple parameter scales and training checkpoints. Since all Pythia models are trained on the same dataset, The Pile, this gives a relatively controlled setting for comparing compression performance across scale and training progress.

Following Delétang et al. (2023), I use each model’s predicted next-token probabilities as the probability distribution for an arithmetic encoder, which is a near-optimal entropy coder. For the text experiments, I use the first 2048 chunks of enwik8, with each chunk containing 2048 bytes. I also apply the same pipeline to non-text modalities, using ImageNet-1k patches for images and LibriSpeech audio for speech.

Below are the compression ratios obtained across model scales, training checkpoints, and data modalities. I fit a Kaplan-style power law of the form:
:

Text Compression

Model1 k8 k32 k128 k143 k
pythia-70M0.2230.1760.170.1730.175
pythia-160M0.2180.1590.1490.1490.150
pythia-410M0.2230.1480.1360.1290.128
pythia-1B0.2070.1400.1280.1200.120
pythia-1.4B0.2070.1370.1240.1150.115

Image Compression

Model1 k8 k32 k128 k143 k
pythia-70M0.6010.4990.4920.5050.513
pythia-160M0.6150.4830.4710.4820.492
pythia-410M0.6680.5060.4610.4440.447
pythia-1B0.6010.4700.4560.4360.440
pythia-1.4B0.6430.4820.4700.4340.436

Speech Results

Model1 k8 k32 k128 k143 k
pythia-70M0.6950.4600.4390.4750.466
pythia-160M0.6780.4400.4300.4330.456
pythia-410M0.7700.5050.4040.3830.391
pythia-1B0.6770.4240.4440.3760.384
pythia-1.4B0.7520.4690.4430.3780.385

Combined Scaling Curves

The below plot shows the overall scaling laws plots across all the three modalities, While the scaling law trend is present in non-textual modalities, the compression is not as strong in text.

Scaling curves for text, image, speech

One possible explanation is that this behavior arises from two complementary mechanisms. First, in-context learning may allow the model to rapidly adapt to local structure within a sequence, such as repeating pixel patterns in images or periodic signals in audio. Second, pretraining may induce broad statistical priors over natural data, including Zipfian distributions, heavy tails, and long-range correlations that occur across many modalities. A useful direction for future work would be to disentangle these effects and quantify how their relative contributions evolve during pretraining.