Curated. This feels like an obvious idea (at least in retrospect), and I haven’t seen anyone else discuss it. The fact that you ran experiments and got interesting results puts this above my bar for curation.
I also appreciated the replies comparing it to ELK and debate paradigms. I’d love to see more discussion in the comments about how it relates to ELK.
I’m not very optimistic about this scaling to smarter models in domains where solutions are harder to verify, but I’m not confident in that take, and I hope I’m wrong. Either way, it likely helps with current models in easier-to-verify domains, and it seems like the implementation is close to ready, which is pretty cool.
Thank you! GPT-5 is a pretty smart model and we have used it in domains where the grader did not have ground truth, but I agree we need to study it more!
Curated. This feels like an obvious idea (at least in retrospect), and I haven’t seen anyone else discuss it. The fact that you ran experiments and got interesting results puts this above my bar for curation.
I also appreciated the replies comparing it to ELK and debate paradigms. I’d love to see more discussion in the comments about how it relates to ELK.
I’m not very optimistic about this scaling to smarter models in domains where solutions are harder to verify, but I’m not confident in that take, and I hope I’m wrong. Either way, it likely helps with current models in easier-to-verify domains, and it seems like the implementation is close to ready, which is pretty cool.
Thank you! GPT-5 is a pretty smart model and we have used it in domains where the grader did not have ground truth, but I agree we need to study it more!