Nice summary! I agree, this is an interesting paper :)
But learning to be predictive of such random future states seems like it falls subject to exactly the same problem as learning to be predictive of future observations: you have no guarantee that EfficientZero will be learning relevant information, which means it could be wasting network capacity on irrelevant information. There’s a just-so story you could tell where adding this extra predictive loss results in worse end-to-end behavior because of this wasted capacity, just like there’s a just-so story where adding this extra predictive loss results in better end-to-end behavior because of faster training. I’m not sure why one turned out to be true rather than the other.
This mostly depends on the size of your dataset. For very small datasets (100k frames here), the network is overparameterized and can easily overfit, adding the consistency loss provides regularisation that can prevent this.
For larger datasets (eg standard 200 million frame setting in Atari) you’ll see less overfitting, and I would expect the impact of consistency loss to be much smaller, possibly negative. The paper doesn’t include ablations for this, but I might test it if I have time.
To phrase differently: the less data you have for your real objective the more you can benefit from auxiliary losses and regularisation.
An important distinction here is that the number of tokens a model was trained for should not be confused with the number of tokens in a dataset: if each token is seen exactly once during training then it has been trained for one “epoch”.
In my experience scaling continues for quite a few epochs over the same datset, only if the model has more parameters than the datset tokens and training for >10 epochs does overfitting kick in and scaling break down.