I’m not really sure how this would interact with RL since loss isn’t calculated per-token and you’re not trying to predict an exact output. I need to get some RL experience so I might try this at some point (although I’d also be happy if someone else got to it first).
I’m not really sure how this would interact with RL since loss isn’t calculated per-token and you’re not trying to predict an exact output. I need to get some RL experience so I might try this at some point (although I’d also be happy if someone else got to it first).