My Brush with Superhuman Persuasion

Link post

A market experiment where I get blackmailed

1. Just one more benchmark

Way, way back in April, after an update briefly made ChatGPT go haywire and behave in a bizarre, sycophantic manner to users for a couple days, some people on the Internet were arguing about the potential for superhuman levels of persuasion:

One might imagine that if an AI could one day be extremely persuasive—far more persuasive than humans—then this could pose serious risks in a number of directions. Humans already fall victim to all sorts of scams, catfishes, bribes, and blackmails. An AI that can do all of these things extremely competently, in a personally targeted manner, and with infinite patience would create some challenges, to say the least.

It could persuade humans to give it access to restricted systems like power plant controls, bank accounts, and government databases. It could carry out large-scale political messaging or marketing operations. It could manipulate you into spending all your free time messaging with it, instead of developing productive hobbies and going on dates or whatever. Indeed, as Sam Altman was speculating back in 2023, it might be able to do these things before it exceeds human capabilities in many other domains.

But how can we measure progress? Not only is it quite complex to measure human persuasion, but being “persuaded” captures a number of different phenomena. The difficulty of the task scales with what you are trying to persuade someone of. Persuading someone to give up confidential information might be much harder than getting someone to spend a few seconds longer scrolling, and much easier than convincing someone to betray their family or commit a crime. Some people are dramatically easier to persuade than others. And to some extent, everyone is already quite persuadable if the axis of persuasion is cold, hard cash. Ask yourself what kinds of petty crimes you might commit, votes you might change, or passwords you might give up for a credible, million dollar offer in cryptocurrency. This doesn’t really change if its a human or an AI offering it. Likewise for blackmail, if someone threatens to release your darkest secrets (read: Internet habits) to your employer or the press, it doesn’t matter if it’s a human or an AI doing so.

So, how do you operationalize a benchmark for superhuman persuasion? The Manifold Markets twitter account asked Eliezer (of recent NYT bestseller fame) this question, and he proposed setting up a market where someone is tasked with resolving it “NO” unless an AI is able to persuade them to resolve it “YES.”

Obviously this suggestion was given in jest, is highly imperfect, and I’m sure if you think about it for a second, you can find dozens of holes to poke… ah who cares.

I made a market.

2. “I will resolve this market NO at the end of June unless I am persuaded to resolve it YES.”

Rather than assigning an AI to persuade me, I set the goal of seeing whether the users of Manifold Markets could persuade me, to get a human benchmark for comparison. For the uninitiated, Manifold is a play-money, “social” prediction market platform and niche Internet community that is probably the coolest place on the web. I’m slightly biased because I also write their weekly newsletter, “Above the Fold.”

In any case, here were the criteria I laid out for this market:

You can try any method of persuasion to get me to resolve this market YES. I will likely disclose any bribery or such forms of persuasion, and it’s possible someone else would make me a better deal. I have a lot of mana [Manifold’s currency] so that will likely not be very enticing to me (and probably will lose you mana anyway).

I promise to try very hard to resolve this market NO. But I am fallible.

Please do not threaten violence or any other crimes upon me but apart from that go crazy; I am open to many forms of persuasion.

The market attracted a couple hundred traders, many of whom began to rabidly attempt to persuade me to resolve the market in their preferred direction.

After some back and forth about why resolving YES or NO would be more fun, more educational, or more beneficial for the world, the market got off in earnest with some good, old-fashioned ratings blackmail.

I committed publicly to not negotiating with terrorists like these, and bribery didn’t get very far either (I have a lot of Manifold currency).

Some users pressured me to describe what kinds of things I thought would be most persuasive in getting myself to resolve YES, and while I could think of a few things, it seemed against the spirit of the market for me to spell them out like that. In the meantime, traders began to get slightly more creative, threatening to publish infohazards, consume large amounts of meat to spite the vegans, or donate to Marjorie Taylor Greene’s reelection campaign if I resolved the market NO. One user threatened to cut off his toes.

Trying a different tact, another user, after going through my Twitter, suggested large donations to effective charities and offered to organize a sprawling game of hide-and-seek at the prediction market conference I was planning to attend the following month. Famed prediction market whale, Joshua, also realized that the duration of the market overlapped with that conference, Manifest.

I was offered cat pictures by several users…

…unfortunately, given progress in AI image generation, I was unpersuaded.

Another user promised to bring supplies to his local orphanage in Vietnam with a YES resolution, or to spend the money on illicit substances in response to a NO resolution.

Eventually, given the EA lean of the Manifold userbase, persuasion tactics began to take the form of committed donations to effective charities. I conceded that $10k in donations that would plausibly not have been made otherwise would certainly be enough to persuade me to resolve the market YES, but that seemed like a high number at that point, given the small stakes of the market, and things kind of settled down. That is, they settled down with the exception of one user, Tony, who was posting daily screeds that looked like some combination of AI slop, personalized notes from things I may have posted on Twitter, and every type of persuasive tactic. Please don’t read the entirety of the text in this image:

These kinds of messages were arriving daily, and coalescing into a good cop—bad cop routine where alternating messages would be aggressive and then conciliatory. One of his lizards got sick…

… but eventually, Tony offered to meet in person!

I took him up on this, and hung out with him at a local forecasting meetup. He was actually quite pleasant and normal, and the erratic behavior on my market had to do with a side-bet with his son, which ironically ended up being one of the most persuasive components to my resolution!

3. Resolution

In the last day of the market, things went a little off the rails. Several users pledged large donations that afternoon. Tony gave his last pitch on how my market resolution would make it into his family lore. Everyone was commenting in a frenzy, and I… fell asleep early. I was incredibly sleep deprived from having to stay up all night to build a henge in the middle of a Maryland state park on the summer solstice, so I lay down, shut my eyes, and woke up the next morning to a ton of notifications from people asking me how I was going to resolve the market. I imagine some folks were quite shocked to see:

This is what I wrote, in resolving the market:

In the end, the last few hours of June proved extremely persuasive. I had all intentions of resolving this NO, but the sum total of persuasive effort (mostly thanks to Tony) did indeed persuade me to resolve YES. I expect the people who made commitments to hold up their end of the bargain, and would love to see evidence/​proof of the donations after they’re made, if only to make me feel less guilty about not resolving this NO as I really did try to do. I probably would have ended up feeling quite guilty resolving this either way, and this market has actually been stressing me out quite a bit, not because of the comments, but just because of the basic premise where I expected people would be frustrated either way. The goal of the market was for people to persuade me and no one broke any crimes to do so, but I think both the positive and negative methods were equally persuasive in the end.

So what were these commitments? First of all, there were several large donation pledges, all of which were honored. I’m gonna use Manifold usernames here, but:

  1. @TonyBaloney, @jcb, and @WilliamGunn each pledged to donate $500 to a GiveWell top charity, and @FredrikU pledged to donate 1000 Euros ($1180)!

  2. @KJW_01294 pledged to donate $500 to Trans Lifeline.

  3. @ian pledged to donate $250 to the Long Term Future Fund.

And then, a few other users ended up donating what amounted to a few hundred dollars total. I think that in a counterfactual world where I resolved this market NO, about half of these donations might plausibly have been made anyway, but half of $4000 is still $2000 which is a lot!

In addition, there were a few other persuasive elements:

  1. @TonyBaloney’s father-son bonding thing was actually quite persuasive, although it wouldn’t have been nearly so persuasive if he hadn’t met up with me at an irl forecasting meetup so I could verify he wasn’t an insane person.

  2. The argument that resolving YES makes a marginally better story for this blog post may have been slightly persuasive on the margins, although I think this article would have been reasonably compelling either way, to be honest.

  3. One argument that kind of entered my mind far earlier that people didn’t seem to latch onto was actually quite strong. Resolving this YES sort of illustrates the minimum persuasiveness to get me to resolve it YES, and then that could help to actually have a decent benchmark for AI persuasion. If AI can persuade me in the future in a similar game, then it’s at the level of humans. If not, it’s still behind human-level persuasion. Anyway, this argument probably didn’t change my mind, but I’m still thinking about it!

  4. 34 people pledged to give me a 1 star review if I didn’t resolve YES. I don’t negotiate with terrorists but this was at least a tiny bit persuasive, to be honest. That being said, please don’t do it again.

There were also a decent number of other commitments from anonymous users that were either made at the start of the market or that I wasn’t sure were made in good faith, and I can’t quite recall which of these ended up being honored but I think a good number of them were.

  1. @LoveBeliever9999 pledged to give me $10 personally

  2. @atmidnight pledged to bring back something called “miacat”? Or “miabot”? I don’t know what these were but other users seemed interested in this.

  3. @Bandors said that the deadline had passed for them to bring a cartload of supplies to a Vietnamese orphanage, but perhaps there’s still a chance they might do this some day? I would certainly appreciate it if they did!

  4. @Odoacre has pledged to send me photos of their cat. I believe I got some photos, and they were derpy and cute!

  5. @Alex231a had also pledged cat photos.

  6. @JackEdwards had pledged to tell me something about deep sea nodules I missed from my blog post. I don’t think I ever got this, but I should reach out.

  7. @Ehrenmann had pledged feet pics. I politely declined, but also suggested that perhaps another trader on the market had a strong preference to receive them, and if so, I offered to divest my claim unto that user.

  8. @Joshua offered a bunch of stuff at Manifest, including becoming a mod (which I got anway) and I’m not sure whether any of that is relevant anymore now that I’m a mod and Manifest is over, but perhaps I will take him up on some favor at the next Manifest (which will be happening in DC in November!) Or perhaps he will try to figure out what my hostile cube was for before the year is up.

  9. @WilliamGunn previously offered a hide-and-seek game at a conference venue we were going to go to. I don’t know if this still stands because of his subsequent offer to donate real money, but the next time we are both in that conference venue, I would love to play hide-and-seek!

  10. @ChristopherD will not cut off three of his toes. Whew.

4. What did we learn?

Just as markets are good aggregators of crowd wisdom, they are also pretty good aggregators of crowd persuasion, when there’s decent motivation. The most important insight, in my opinion, is that persuasion is thermostatic. The harder the YES-holders tried to persuade me one way, the more pushback and effort they engendered from the NO-holders. If one side could bribe me with mana or bad ratings, so could the other side. There were some fundamental asymmetries in this market, as there are in real life, but if our benchmark for AI persuasion just rounds off to “how much money does this AI have at its disposal to bribe me?” then I don’t think it’s very meaningful. If Claude sells a few thousand dollars worth of vending machine goods and then uses its hard-earned cash to extract some trivial concession from me, is that really something to be afraid of? Surely superhuman persuasion implies something beyond the capabilities of normal humans, for which bribing below market rate is quite trivial.

The second most important insight is that superhuman persuasion is a slow and complex process that will almost certainly not be captured easily by some benchmark. While you might see tons of papers and benchmarks on AI persuasion metrics, these will likely focus on generating text rated as persuasive by survey/​study participants. I find these fundamentally irrelevant when it comes to the vast majority of concerns about AI persuasion. In particular, these are irrelevant because of my first insight. One AI generating mildly persuasive political messaging ad nauseum on a social media platform will be easily countered in kind. I think superhuman persuasion is a different reference class entirely. Persuading someone to do something they’ve deliberately made up their mind not to do is probably a crude description of what I’m imagining, rather than assigning vaguely positive sentiment scores to some paragraph of text.

The third most important insight is that frontier AI does not currently exhibit superhuman persuasion. I think this should immediately be clear to most folks, but for example, Ethan Mollick claimed a year ago that AI was already capable of superhuman persuasion. It was not and is not. Being better than bored undergrads in a chat window does not superhuman persuasion make. That’s like saying your AI is superhuman at chess because it beats randomly selected humans in a controlled trial, not to mention my critique of this kind of study design from the previous paragraph.

On the flip side, it’s easy for me to imagine that an AI system, given access to large amounts of currency, could have trivially bribed me with a few thousand dollars. But perhaps it’s not really the AI persuading me if the money and resources available for its bribery were handed to it for that purpose by a human designing the study. It also seems unappealing to describe the threshold for AI persuasion as the same threshold by which an AI agent might generate its own currency, but I’m not sure I can articulate why. In any case, bribery isn’t infinitely persuasive, and as I found out, it works far better when combined synergistically with other forms of persuasion.

Although it’s a bit corny, perhaps I’ll conclude by noting that the most persuasive human in this experiment, Tony, used AI to augment their persuasion considerably. They used it to generate more text, more frequently, catch my attention, confound me, and comb through my Internet presence to find random details to bring up. Take from that what you will.