The paper does have a few empirical experiments showing they arrive at different solutions. Specifically the KL-reward plot. Would you need more settings to be convinced here?
The paper does have a few empirical experiments showing they arrive at different solutions. Specifically the KL-reward plot. Would you need more settings to be convinced here?