Super work! It must have required a crazy amount of technical and manual work.
The fact that you manage to reduce the number of failures by 3 orders of magnitude is quite impressive. Did you separate train and test set at the beginning of the project?
In conclusion, using a 300M parameter model to supervise a 10 times larger model seems to work well, it gives a lot of hope.
Super work! It must have required a crazy amount of technical and manual work.
The fact that you manage to reduce the number of failures by 3 orders of magnitude is quite impressive. Did you separate train and test set at the beginning of the project?
In conclusion, using a 300M parameter model to supervise a 10 times larger model seems to work well, it gives a lot of hope.