Yeah, until recently I thought the same thing, based on my belief that distilling a teacher model which has been trained by RL into a student model preserved not just distributions over outputs but also mostly preserved the mechanisms behind those outputs. Which as far as I can tell was an incorrect belief.
Yeah, until recently I thought the same thing, based on my belief that distilling a teacher model which has been trained by RL into a student model preserved not just distributions over outputs but also mostly preserved the mechanisms behind those outputs. Which as far as I can tell was an incorrect belief.