Thanks! Vending-Bench v2 is going to be fire. Would love to include gpt5 <3
Lukas Petersson
This is a great point. I admit I have to better understand what each model provider does behind the scenes in the API. Sad if the days of access to the model is gone.
We thought about that, but then it’s not reproducible if we want to run it for new models later
Thanks, that would be great!
You can’t eval GPT5 anymore
AI misbehaviour in the wild from Andon Labs’ Safety Report
Thanks for highlighting our work!
The Same Heaven
Linguistic Imperialism in AI: Enforcing Human-Readable Chain-of-Thought
AI Safety as a YC Startup
This is probably not the first barrier to getting into evals, but I have an AI safety startup that designs evals. However, we don’t have the capacity to also do good elicitation. I think we lose a lot of signal from our evals because our agent is too weak to explore properly. We’re currently using Inspect’s
basic_agent
. Metr’smodular_public
is better, but we preferinspect
overvivaria
otherwise. I think open-sourcing a better agent would be positive for the evals community without contributing to capabilities.
We set it to some date in the future