I think it’s hard to get a useful model for reasons related to the blatant reward hacking—the difficulty of RL on long horizon tasks without a well-defined reward signal.
I think it’s hard to get a useful model for reasons related to the blatant reward hacking—the difficulty of RL on long horizon tasks without a well-defined reward signal.