If you have only tiny model capacity and abundant reward feedback purely supervised learning wins—as in the first early successes in DL like alexnet and DM’s early agents. This is expected because when each connection/param is super precious you can’t ‘waste’ any capacity by investing it in modeling world bits that don’t have immediate payoff.
But in the real world sensory info vastly dwarfs reward info, so with increasing model capacity UL wins—as in the more modern success of transformers trained with UL. The brain is very far along in that direction—it has essentially unlimited model capacity in comparison.
1/2: The issue is the system can’t easily predict what aspects will be future important much later for downstream tasks.
All that being said I somewhat agree in the sense that perplexity isn’t necessarily the best measure (the best measure being that which best predicts performance on all the downstream tasks).
If you have only tiny model capacity and abundant reward feedback purely supervised learning wins—as in the first early successes in DL like alexnet and DM’s early agents. This is expected because when each connection/param is super precious you can’t ‘waste’ any capacity by investing it in modeling world bits that don’t have immediate payoff.
But in the real world sensory info vastly dwarfs reward info, so with increasing model capacity UL wins—as in the more modern success of transformers trained with UL. The brain is very far along in that direction—it has essentially unlimited model capacity in comparison.
1/2: The issue is the system can’t easily predict what aspects will be future important much later for downstream tasks.
All that being said I somewhat agree in the sense that perplexity isn’t necessarily the best measure (the best measure being that which best predicts performance on all the downstream tasks).