$1000 bounty for OpenAI to show whether GPT3 was “deliberately” pretending to be stupider than it is

Link post

Twitter thread by Eliezer Yudkowsky, with the bounty in bold:

So I don’t want to sound alarms prematurely, here, but we could possibly be looking at the first case of an AI pretending to be stupider than it is. In this example, GPT-3 apparently fails to learn/​understand how to detect balanced sets of parentheses.

Now, it’s possibly that GPT-3 “legitimately” did not understand this concept, even though GPT-3 can, in other contexts, seemingly write code or multiply 5-digit numbers. But it’s also possible that GPT-3, playing the role of John, predicted that *John* wouldn’t learn it.

It’s tempting to anthropomorphize GPT-3 as trying its hardest to make John smart. That’s what we want GPT-3 to do, right? But what GPT-3 actually does is predict text continuations. If *you* saw John say all that—would you *predict* the next lines would show John succeeding?

So it could be that GPT-3 straight-up can’t recognize balanced parentheses. Or it could be that GPT-3 could recognize them given a different prompt. Or it could be that the cognition inside GPT-3 does see the pattern, but play-acts the part of ‘John’ getting it wrong.

The scariest feature of this whole incident? We have no idea if that happened. Nobody has any idea what GPT-3 is ‘thinking’. We have no idea whether this run of GPT-3 contained a more intelligent cognition that faked a less intelligent cognition.

Now, I *could* be wrong about that last part!

@openAI could be storing a record of all inputs and randseeds used in GPT-3 instances, so that they can reconstruct any interesting runs. And though it seems less likely,

@openAI could somehow have any idea what a GPT-3 is thinking.

So I hereby offer a $1000 bounty—which I expect to go unclaimed—if @openAI has any means to tell us definitively whether GPT-3 was ‘deliberately’ sandbagging its attempt to recognize balanced parentheses, in that particular run of the AI Dungeon. With an exception for…

...answering merely by showing that, despite a lot of other attempts at prompting under more flexible circumstances, GPT-3 could not learn to balance parentheses as complicated as those tried by Breitman. (Which does answer the question, but in a less interesting way.)

If @openAI can’t claim that bounty, I encourage them to develop tools for recording inputs, recording randseeds, and making sure all runs of GPTs are exactly reproducible; and much more importantly and difficultly, getting greater internal transparency into future AI processes.

Regardless, I unironically congratulate @openAI on demonstrating something that could plausibly be an alignment failure of this extremely-important-in-general type, thereby sharply highlighting the also-important fact that now we have no idea whether that really happened. (END.)

As stated, this bounty would only be paid out to OpenAI.

I’m still posting it under the “Bounties” tag, for two reasons:

1) I don’t find it implausible that someone could at least make progress on Eliezer’s question with clever prompting of the API, in a way that would be of interest to him and others, even if it didn’t result in any bounty.

2) I like to collect instances of public bounties in a single place, for future reference. I think they are a very interesting, and underused, strategy for navigating the world. The LessWrong “Bounties (active)” and “Bounties (closed)” tags work well for that.