Have you heard of https://openai.com/index/paperbench/ and https://github.com/METR/RE-Bench ? They seem like they have some genuine multi-hour agentic coding tasks, I’m curious if you agree.
I’ll take a look. Thanks for sharing.
I’ll take a look. Thanks for sharing.