If you would like the LLM to be truly creative, then check out the Science Bench where the problems stump SOTA LLMs despite the fact that the LLMs have read nearly every book on every subject. Or EpochAI’srecentresults.
I mean, GPT-5 getting 43% of PhD problems right isn’t particularly bad. I don’t know about making new insights but it doesn’t seem like it would be unachievable (especially as it’s possible that prompting/tooling/agent scaffolding might compensate for some of the problems).
I’m thinking often about whether LLM systems can come up with societal/scientific breakthrough.
My intuition is that they can, and that they don’t need to be bigger or have more training data or have different architecture in order to do so.
Starting to keep a diary along these lines here: https://docs.google.com/document/d/1b99i49K5xHf5QY9ApnOgFFuvPEG8w7q_821_oEkKRGQ/edit?usp=sharing
If you would like the LLM to be truly creative, then check out the Science Bench where the problems stump SOTA LLMs despite the fact that the LLMs have read nearly every book on every subject. Or EpochAI’s recent results.
I mean, GPT-5 getting 43% of PhD problems right isn’t particularly bad. I don’t know about making new insights but it doesn’t seem like it would be unachievable (especially as it’s possible that prompting/tooling/agent scaffolding might compensate for some of the problems).
Science bench is made by a Christian Stump. LLMs are literally stumped.
thanks for sending science bench in particular.