That is very interesting! I am curious about the Opus 4 > Opus 4.5 result. Any guess why the newer model does worse here?
It’s a third of the price, so maybe it’s just a smaller model.
I came here to comment the exact same thing. I wonder if 2-hop latent reasoning is correlated well with Simple-QA scores.
That is very interesting! I am curious about the Opus 4 > Opus 4.5 result. Any guess why the newer model does worse here?
It’s a third of the price, so maybe it’s just a smaller model.
I came here to comment the exact same thing. I wonder if 2-hop latent reasoning is correlated well with Simple-QA scores.