I think you want to reward output rather than output that would not have otherwise happened.
This is similar to the fact that if you want to train calibration, you have to optimize you log score and just observe your lack of calibration as an opportunity to increase your log score.
I think you want to reward output rather than output that would not have otherwise happened.
This is similar to the fact that if you want to train calibration, you have to optimize you log score and just observe your lack of calibration as an opportunity to increase your log score.