Why on earth would we expect a distilled model to have continuity of experience with the model it was distilled from? Even if you subscribe to computationalism, the distilled model is not the same algorithm.
continuity is a function of memory. although model distillation uses the term knowledge, it’s the same concept. it might not apply to current models, but i suspect at some point future models will essentially be ‘training’ 24⁄7, the way the human mind uses new experiences to update it’s neural connections instead of simply updating working memory.
Why on earth would we expect a distilled model to have continuity of experience with the model it was distilled from? Even if you subscribe to computationalism, the distilled model is not the same algorithm.
continuity is a function of memory. although model distillation uses the term knowledge, it’s the same concept. it might not apply to current models, but i suspect at some point future models will essentially be ‘training’ 24⁄7, the way the human mind uses new experiences to update it’s neural connections instead of simply updating working memory.
There are different types of distillation. There is pruning, for example. This is a frontier model too, who knows what technique they used.