One fine-tuning run across multiple rollouts of the model. Though it might be worth doing at least one more training run to see how consistent the results are.
makiba
Karma: 85
One fine-tuning run across multiple rollouts of the model. Though it might be worth doing at least one more training run to see how consistent the results are.
Agree that 0-100 granularity is unnecessary. I also think the engagement signal is part of why responses read like an autobiography. Mistral’s responses to opinionated questions would almost always foreground details about itself before answering.
That said, I’m not sure whether your exact solution would work. It’s worth mentioning that before introducing an engagement signal, models would converge to flat denials and evasive responses which was not very substantive. I think some granularity is useful (something like 0-10) for easier RL, but if engagement is still in the reward as a shaping signal, I’m not sure how much the granularity on the other signals will actually matter.