Yup, here is such a plot, made after training “switcher” architecture for 350k examples. I remember it was similar for the longer training—a few longest task lengths struggle, but the rest is near 100%.
Yup, here is such a plot, made after training “switcher” architecture for 350k examples. I remember it was similar for the longer training—a few longest task lengths struggle, but the rest is near 100%.