Lech Mazur comments on chinchilla’s wild implications

Lech Mazur 7 Aug 2022 1:58 UTC
4 points
1
You’re describing a data augmentation variant of the teacher-student knowledge distillation. It can work well.

16 bits/parameter is most commonly supported but 8-bit quantization can also be used.

Performance does not depend only on the number of parameters but also on the architecture.

High-end smartphones commonly have special-purpose processors for neural networks, so their performance is not bad.