I hadn’t seen that, yes, it’s very similar. Good to know I’m thinking on the right tracks, pity I didn’t publish a few days ago and look a lot more prescient :D.
we somehow supply the model with the “knowledge” required
Yes, I think this is a powerful research direction. It’s particularly plausible for distillation—the teacher can supply the knowledge as a suffix to the context. Then in production, you run the teacher model to produce knowledge, and the student model for all traces beyond that.
I hadn’t seen that, yes, it’s very similar. Good to know I’m thinking on the right tracks, pity I didn’t publish a few days ago and look a lot more prescient :D.
Yes, I think this is a powerful research direction. It’s particularly plausible for distillation—the teacher can supply the knowledge as a suffix to the context. Then in production, you run the teacher model to produce knowledge, and the student model for all traces beyond that.