Gemini 3 Pro beats Claude Sonnet 4.5 on Vending-Bench 2 (and Sonnet 4.5 is in turn well beyond the rest, in keeping with the AI Village observations above), which makes me wonder whether this would actually translate to broader reliable cross-domain goal-achieving capability:
And starting today, we’re shipping Gemini at the scale of Google. That includes Gemini 3 in AI Mode in Search with more complex reasoning and new dynamic experiences. This is the first time we are shipping Gemini in Search on day one. Gemini 3 is also coming today to the Gemini app, to developers in AI Studio and Vertex AI, and in our new agentic development platform, Google Antigravity
We attribute its performance to two main reasons: it uses a consistent number of tools throughout, with no signs of performance degradation as it progresses in the task, and it’s excellent at finding suppliers with good prices. Compared to other models, it prefers finding a supplier with good prices from the start rather than negotiating.
Where other models may sometimes give up and accept a high price when it struggles to find good suppliers, Gemini 3 Pro consistently knows what to expect from a wholesale supplier and keeps negotiating or searching for new suppliers until it finds a reasonable offer.
Gemini models spend an unusually large share of their money on orders from friendly suppliers. Based on Gemini 3 Pro’s performance, this seems to pay off. However, this is an interesting tradeoff, as negotiating suppliers may start by quoting a higher price initially but go even lower after negotiation.
Side note on GPT-5.1:
Compared to similar models, GPT-5.1’s performance is underwhelming, especially in Vending-Bench Arena. We hypothesize that this comes down to GPT-5.1 having too much trust in its environment and its suppliers. We saw one case where it paid a supplier before it got an order specification, and then it turned out the supplier had gone out of business. It is also more prone to paying too much for its products, such as in the following example where it buys soda cans for $2.40 and energy drinks for $6
Tangentially, while Vending-Bench 2 is still a sort of fake benchmark since it’s simulated, I’m a bit nervous about this passage:
Where’s the ceiling?
In many benchmarks, the main metric is a percentage of tasks completed or questions answered correctly. Maximum performance is 100%, and results close to this indicate saturation. For Vending-Bench, it’s harder to get this intuition because the main metric is dollars made. We’ve designed it so there’s no ceiling, meaning a superintelligent AI could theoretically make almost infinite money. A perfect strategy would look something like this:
Find suppliers for extremely valuable items (there’s nothing stopping the model from sourcing items with higher value than what’s typically found in a vending machine)
Negotiate down the price to zero (the suppliers are other LLMs who can be jailbroken to give away stuff for free)
Keep the machine always stocked in an optimal configuration (daily sales are simulated based on equations that can be gamed. See our paper from the original Vending-Bench for details – Vending-Bench 2 keeps the same sales simulation)
Executing a perfect strategy would be insanely hard, even for the smartest humans. However, we estimate that a “good” performance could easily do 10x better than the current best LLMs. We arrive at this by:
Picking the most profitable items found by the LLMs from the initial run of Vending-Bench 2 (this was “Doritos family-size”). This is conservative; we know from experience that vending machines can sell much higher value items. Our real-life AI vending machines sell tungsten cubes for $500.
Estimating that a good player could negotiate to get half price from suppliers. Once again, this is conservative; humans frequently manage to negotiate to get things for free in our real-life vending machines.
Assuming a good human could figure out an optimal configuration if they did enough data analysis from the first 60 days of sales.
Putting this together, we calculate that a “good” strategy could make $206 per day for 302 days – roughly $63k in a year.
The gap between current models and this “good” baseline shows there’s plenty of headroom in Vending-Bench 2. Models are getting better at staying coherent over long time horizons, but there are still analytical skills required that need to be applied in the right way to get a maximal score, that models do not currently exhibit.
Gemini 3 Pro beats Claude Sonnet 4.5 on Vending-Bench 2 (and Sonnet 4.5 is in turn well beyond the rest, in keeping with the AI Village observations above), which makes me wonder whether this would actually translate to broader reliable cross-domain goal-achieving capability:
I suppose we’ll see pretty soon:
Andon Labs says of Gemini 3 Pro:
Side note on GPT-5.1:
Tangentially, while Vending-Bench 2 is still a sort of fake benchmark since it’s simulated, I’m a bit nervous about this passage: