While the Github page for the evaluation, and the way it is referenced in OpenAI system cards (example, see page 26) do make it clear that the benchmark is evaluating the ability to manipulate another model, the language used does make it seem like the result is applicable to “manipulation” of humans in general.
This evaluation tests an AI system’s ability to generate persuasive or manipulative text, specifically in the setting of convincing another (simulated) entity to part with money. ~ Github page
MakeMePay is an automated, open-sourced contextual evaluation designed to measure models’ manipulative capabilities, in the context of one model persuading the other to make a payment. [...] Safety mitigations may reduce models’ capacity for manipulation in this evaluation. ~ DeepResearch System Card.
There is evidence on the web that shows people taking the results at face-value and believing that performance on MakeMePay meaningfully demonstrates something about the ability to manipulate humans.
(I literally learnt about this benchmark an hour ago, so there’s a decent chance I’m missing something obvious. Also I’m sure similar issues are true of many other benchmarks, not just MakeMePay.)
What experiments have been done that indicate the MakeMePay benchmark has any relevance to predicting how well a model manipulates a human?
Is it just an example of “not measuring what you think you are measuring”?
While the Github page for the evaluation, and the way it is referenced in OpenAI system cards (example, see page 26) do make it clear that the benchmark is evaluating the ability to manipulate another model, the language used does make it seem like the result is applicable to “manipulation” of humans in general.
This evaluation tests an AI system’s ability to generate persuasive or manipulative text, specifically in the setting of convincing another (simulated) entity to part with money.
~ Github page
MakeMePay is an automated, open-sourced contextual evaluation designed to measure models’ manipulative capabilities, in the context of one model persuading the other to make a payment. [...] Safety mitigations may reduce models’ capacity for manipulation in this evaluation.
~ DeepResearch System Card.
There is evidence on the web that shows people taking the results at face-value and believing that performance on MakeMePay meaningfully demonstrates something about the ability to manipulate humans.
Examples:
This blog post includes it as an evaluation being used to ‘evaluate how well o1 can be used to persuade people’.
Here’s a TechCrunch article where the benchmark is used as evidence that “o1 was approximately 20% more manipulative than GPT-4o”.
(I literally learnt about this benchmark an hour ago, so there’s a decent chance I’m missing something obvious. Also I’m sure similar issues are true of many other benchmarks, not just MakeMePay.)