In https://chatgpt.com/share/94152e76-7511-4943-9d99-1118267f4b2b I gave the new model a challenging complex analysis problem (which I had previously asked GPT4 to assist in writing up a proof of in https://chatgpt.com/share/63c5774a-d58a-47c2-9149-362b05e268b4 ). Here the results were better than previous models, but still slightly disappointing: the new model could work its way to a correct (and well-written) solution *if* provided a lot of hints and prodding, but did not generate the key conceptual ideas on its own, and did make some non-trivial mistakes. The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, (static simulation of a) graduate student. However, this was an improvement over previous models, whose capability was closer to an actually incompetent (static simulation of a) graduate student. It may only take one or two further iterations of improved capability (and integration with other tools, such as computer algebra packages and proof assistants) until the level of “(static simulation of a) competent graduate student” is reached, at which point I could see this tool being of significant use in research level tasks. (2/3)
I posed the identical question to my version of #o1 and it returned a perfect answer: https://chatgpt.com/share/66e7153c-b7b8-800e-bf7a-1689147ed21e . Admittedly, the above MathOverflow post could conceivably have been included in the training data of the model, so this may not necessarily be an accurate evaluation of its semantic search capabilities (in contrast with the first example I shared, which I had mentioned once previously on Mastodon but without fully revealing the answer). Nevertheless it demonstrates that this tool is on par with question and answer sites with respect to high quality answers for at least some semantic search queries. (1/2)
Full quote on Mathstodon for others’ interest:
This o1 vs MathOverflow experts comparison was also interesting:
(I believe the version he tested was what later became o1-preview.)