We test on GPT-4.1. Which I think is frontier-ish at least 4 months ago.
I agree with the principle with testing more models. I’m most interested in RL environments!
We test on GPT-4.1. Which I think is frontier-ish at least 4 months ago.
I agree with the principle with testing more models. I’m most interested in RL environments!