$1000 bounty for OpenAI to show whether GPT3 was “deliberately” pretending to be stupider than it is

Link post

Twit­ter thread by Eliezer Yud­kowsky, with the bounty in bold:

So I don’t want to sound alarms pre­ma­turely, here, but we could pos­si­bly be look­ing at the first case of an AI pre­tend­ing to be stupi­der than it is. In this ex­am­ple, GPT-3 ap­par­ently fails to learn/​un­der­stand how to de­tect bal­anced sets of paren­the­ses.

Now, it’s pos­si­bly that GPT-3 “le­gi­t­i­mately” did not un­der­stand this con­cept, even though GPT-3 can, in other con­texts, seem­ingly write code or mul­ti­ply 5-digit num­bers. But it’s also pos­si­ble that GPT-3, play­ing the role of John, pre­dicted that *John* wouldn’t learn it.

It’s tempt­ing to an­thro­po­mor­phize GPT-3 as try­ing its hard­est to make John smart. That’s what we want GPT-3 to do, right? But what GPT-3 ac­tu­ally does is pre­dict text con­tinu­a­tions. If *you* saw John say all that—would you *pre­dict* the next lines would show John suc­ceed­ing?

So it could be that GPT-3 straight-up can’t rec­og­nize bal­anced paren­the­ses. Or it could be that GPT-3 could rec­og­nize them given a differ­ent prompt. Or it could be that the cog­ni­tion in­side GPT-3 does see the pat­tern, but play-acts the part of ‘John’ get­ting it wrong.

The scariest fea­ture of this whole in­ci­dent? We have no idea if that hap­pened. No­body has any idea what GPT-3 is ‘think­ing’. We have no idea whether this run of GPT-3 con­tained a more in­tel­li­gent cog­ni­tion that faked a less in­tel­li­gent cog­ni­tion.

Now, I *could* be wrong about that last part!

@openAI could be stor­ing a record of all in­puts and rand­seeds used in GPT-3 in­stances, so that they can re­con­struct any in­ter­est­ing runs. And though it seems less likely,

@openAI could some­how have any idea what a GPT-3 is think­ing.

So I hereby offer a $1000 bounty—which I ex­pect to go un­claimed—if @openAI has any means to tell us defini­tively whether GPT-3 was ‘de­liber­ately’ sand­bag­ging its at­tempt to rec­og­nize bal­anced paren­the­ses, in that par­tic­u­lar run of the AI Dun­geon. With an ex­cep­tion for…

...an­swer­ing merely by show­ing that, de­spite a lot of other at­tempts at prompt­ing un­der more flex­ible cir­cum­stances, GPT-3 could not learn to bal­ance paren­the­ses as com­pli­cated as those tried by Bre­it­man. (Which does an­swer the ques­tion, but in a less in­ter­est­ing way.)

If @openAI can’t claim that bounty, I en­courage them to de­velop tools for record­ing in­puts, record­ing rand­seeds, and mak­ing sure all runs of GPTs are ex­actly re­pro­ducible; and much more im­por­tantly and difficultly, get­ting greater in­ter­nal trans­parency into fu­ture AI pro­cesses.

Re­gard­less, I uniron­i­cally con­grat­u­late @openAI on demon­strat­ing some­thing that could plau­si­bly be an al­ign­ment failure of this ex­tremely-im­por­tant-in-gen­eral type, thereby sharply high­light­ing the also-im­por­tant fact that now we have no idea whether that re­ally hap­pened. (END.)

As stated, this bounty would only be paid out to OpenAI.

I’m still post­ing it un­der the “Boun­ties” tag, for two rea­sons:

1) I don’t find it im­plau­si­ble that some­one could at least make progress on Eliezer’s ques­tion with clever prompt­ing of the API, in a way that would be of in­ter­est to him and oth­ers, even if it didn’t re­sult in any bounty.

2) I like to col­lect in­stances of pub­lic boun­ties in a sin­gle place, for fu­ture refer­ence. I think they are a very in­ter­est­ing, and un­der­used, strat­egy for nav­i­gat­ing the world. The LessWrong “Boun­ties (ac­tive)” and “Boun­ties (closed)” tags work well for that.