Thanks, that’s a useful alternative framing of CaSc!
FWIW, I think this adversarial version of CaSc would avoid the main examples in our post where CaSc fails to reject a false hypothesis. The common feature of our examples is “cancellation” which comes from looking at an average CaSc loss. If you only look at the loss of the worst experiment (so the maximum CaSc loss rather than the average one) you don’t get these kind of cancellation problems.
Plausibly you’d run into different failure modes though, in particular, I guess the maximum measure is less smooth and gives you less information on “how wrong” your hypothesis is.
The agents are rewarded at every timestep and we want them to perform the task throughout the whole episode, so falling over is definitely not what we want. But this has more to do with the policy optimization failing than with the reward model. In other words a policy that doesn’t fall over would achieve higher reward than the policies we actually learn. For example, if we plot the CLIP reward over one episode, it typically drops at the end of the episode if the agent falls down.
We tried some tricks to improve the training, such as providing a curriculum starting from short episodes to longer ones. This worked decently well and made the agents fall over less, but we ended up not using it in the final experiments because we primarily wanted to show that it works well with off-the-shelf RL algorithms.