Is this also because GPT-5 is much more like a “black box software” and a lot less like a “model”? Do the evals run with the assumption that they are running it on a “model” (or something close enough to it), and not a “black box software” that could be doing absolutely anything behind the scenes (including various web searches, addition of hidden context, filtering, even potentially human mechanical turks answering everything)?
Even if you override the date, if its doing hidden web searches in the back, those will be based on todays date on todays internet and will affect its result. It may not solve your problem if this is the case.
I would imagine future “models” will only increasingly move in that direction of a hybrid approach and less like a true foundation model that anyone can do anything on top of, both for functionality, safety and business-model reasons (e.g. Google may not allow their models to remove ads nor reverse engineer their own software).
This is a great point. I admit I have to better understand what each model provider does behind the scenes in the API. Sad if the days of access to the model is gone.
Is this also because GPT-5 is much more like a “black box software” and a lot less like a “model”? Do the evals run with the assumption that they are running it on a “model” (or something close enough to it), and not a “black box software” that could be doing absolutely anything behind the scenes (including various web searches, addition of hidden context, filtering, even potentially human mechanical turks answering everything)?
Even if you override the date, if its doing hidden web searches in the back, those will be based on todays date on todays internet and will affect its result. It may not solve your problem if this is the case.
I would imagine future “models” will only increasingly move in that direction of a hybrid approach and less like a true foundation model that anyone can do anything on top of, both for functionality, safety and business-model reasons (e.g. Google may not allow their models to remove ads nor reverse engineer their own software).
This is a great point. I admit I have to better understand what each model provider does behind the scenes in the API. Sad if the days of access to the model is gone.