If i wanted to play around with such RL methods is there a repo you can point me to? Or even better is your code available somewhere? Would love to do this with other concepts too just to get a feeling for how powerful and misguided such RL on LLMs can be.
I ran this on runrl.com with the llm-as-judge option and the default settings for everything else (disclaimer: I work for runrl.com and thus have a lot of free credits to experiment with)
Very cool experiment!
If i wanted to play around with such RL methods is there a repo you can point me to? Or even better is your code available somewhere? Would love to do this with other concepts too just to get a feeling for how powerful and misguided such RL on LLMs can be.
I ran this on runrl.com with the llm-as-judge option and the default settings for everything else (disclaimer: I work for runrl.com and thus have a lot of free credits to experiment with)
I see. I assume it would cost me too much to just play around for fun and experimentation.
I think each of these runs was ~$40 (half an hour at $80 per 8xH100 node-hour)