Firming Up Not-Lying Around Its Edge-Cases Is Less Broadly Useful Than One Might Initially Think

Re­ply to: Meta-Hon­esty: Firm­ing Up Hon­esty Around Its Edge-Cases

Eliezer Yud­kowsky, list­ing ad­van­tages of a “wiz­ard’s oath” eth­i­cal code of “Don’t say things that are liter­ally false”, writes—

Re­peat­edly ask­ing your­self of ev­ery sen­tence you say aloud to an­other per­son, “Is this state­ment ac­tu­ally and liter­ally true?”, helps you build a skill for nav­i­gat­ing out of your in­ter­nal smog of not-quite-truths.

I mean, that’s one hy­poth­e­sis about the psy­cholog­i­cal effects of adopt­ing the wiz­ard’s code.

A po­ten­tial prob­lem with this is that hu­man nat­u­ral lan­guage con­tains a lot of am­bi­guity. Words can be used in many ways de­pend­ing on con­text. Even the speci­fi­ca­tion “liter­ally” in “liter­ally false” is less use­ful than it ini­tially ap­pears when you con­sider that the way peo­ple or­di­nar­ily speak when they’re be­ing truth­ful is ac­tu­ally pretty dense with metaphors that we typ­i­cally don’t no­tice as metaphors be­cause they’re com­mon enough to be rec­og­nized le­gi­t­i­mate uses that all fluent speak­ers will un­der­stand.

For ex­am­ple, if I want to con­vey the mean­ing that our study group has cov­ered a lot of ma­te­rial in to­day’s ses­sion, and I say, “Look how far we’ve come to­day!” it would be pretty weird if you were to ob­ject, “Liar! We’ve been in this room the whole time and haven’t phys­i­cally moved at all!” be­cause in this case, it re­ally is ob­vi­ous to all or­di­nary English speak­ers that that’s not what I meant by “how far we’ve come.”

Other times, the “in­tended”[1] in­ter­pre­ta­tion of a state­ment is not only not ob­vi­ous, but speak­ers can even mis­lead by mo­ti­vat­edly equiv­o­cat­ing be­tween differ­ent defi­ni­tions of words: the im­mor­tal Scott Alexan­der has writ­ten a lot about this phe­nomenon un­der the la­bels “motte-and-bailey doc­trine” (as coined by Ni­cholas Shackel) and “the non­cen­tral fal­lacy”.

For ex­am­ple, Zvi Mow­show­itz has writ­ten about how the claim that “ev­ery­body knows” some­thing[2] is of­ten used to es­tab­lish fic­ti­tious so­cial proof, or silence those at­tempt­ing to tell the thing to peo­ple who re­ally don’t know, but it feels weird (to my in­tu­ition, at least) to call it a “lie”, be­cause the speaker can just say, “Okay, you’re right that not liter­ally[3] ev­ery­one knows; I meant that most peo­ple know but was us­ing a com­mon hy­per­bolic turn-of-phrase and I rea­son­ably ex­pected you to figure that out.”

So the ques­tion “Is this state­ment ac­tu­ally and liter­ally true?” is it­self po­ten­tially am­bigu­ous. It could mean ei­ther—

  • “Is this state­ment ac­tu­ally and liter­ally true as the au­di­ence will in­ter­pret it?”; or,

  • “Does this state­ment per­mit an in­ter­pre­ta­tion un­der which it is ac­tu­ally and liter­ally true?”

But while the former is com­pli­cated and hard to es­tab­lish, the lat­ter is … not nec­es­sar­ily that strict of a con­straint in most cir­cum­stances?

Think about it. When’s the last time you needed to con­sciously tell a bald-faced, un­am­bigu­ous lie?—some­thing that could re­al­is­ti­cally be out­right proven false in front of your peers, rather than dis­missed with a “rea­son­able” amount of lan­guage-lawyer­ing. (Whether “Fine” is a lie in re­sponse to “How are you?” de­pends on ex­actly what “Fine” is un­der­stood to mean in this con­text. “Be­ing ac­cept­able, ad­e­quate, pass­able, or satis­fac­tory”—to what stan­dard?)

Maybe I’m un­usu­ally hon­est—or pos­si­bly un­usu­ally bad at re­mem­ber­ing when I’ve lied!?—but I’m not sure I even re­mem­ber the last time I told an out­right un­am­bigu­ous lie. The kind of situ­a­tion where I would need to do that just doesn’t come up that of­ten.

Now ask your­self how of­ten your speech has been par­tially op­ti­mized for any func­tion other than pro­vid­ing listen­ers with in­for­ma­tion that will help them bet­ter an­ti­ci­pate their ex­pe­riences. The an­swer is, “Every time you open your mouth”[4]—and if you dis­agree, then you’re ly­ing. (Even if you only say true things, you’re more likely to pick true things that make you look good, rather than your most em­bar­rass­ing se­crets. That’s op­ti­miza­tion.)

In the study of AI al­ign­ment, it’s a tru­ism that failures of al­ign­ment can’t be fixed by de­on­tolog­i­cal “patches”. If your AI is ex­hibit­ing weird and ex­treme be­hav­ior (with re­spect to what you re­ally wanted, if not what you ac­tu­ally pro­grammed), then adding a penalty term to ex­clude that spe­cific be­hav­ior will just re­sult in the AI ex­e­cut­ing the “near­est un­blocked” strat­egy, which will prob­a­bly also be un­de­sir­able: if you pre­vent your hap­piness-max­i­miz­ing AI from ad­minis­ter­ing heroin to hu­mans, it’ll start ad­minis­ter­ing co­caine; if you hard­code a list of banned hap­piness-pro­duc­ing drugs, it’ll start re­search­ing new drugs, or just pay hu­mans to take heroin, &c.

Hu­mans are also in­tel­li­gent agents. (Um, sort of.) If you don’t gen­uinely have the in­tent to in­form your au­di­ence, but con­sider your­self eth­i­cally bound to be hon­est, but your con­cep­tion of hon­esty is sim­ply “not ly­ing”, you’ll nat­u­rally grav­i­tate to­wards the near­est un­blocked cog­ni­tive al­gorithm of de­cep­tion.[5]

So an­other hy­poth­e­sis about the psy­cholog­i­cal effects of adopt­ing the wiz­ard’s code is that—how­ever no­ble your ini­tial con­scious in­tent was—in the face of suffi­ciently strong in­cen­tives to de­ceive, you just end up ac­ci­den­tally train­ing your­self to get re­ally good at mis­lead­ing peo­ple with a va­ri­ety of not-tech­ni­cally-ly­ing rhetor­i­cal tac­tics (motte-and-baileys, false im­pli­ca­tures, stonewal­ling, se­lec­tive re­port­ing, clever ra­tio­nal­ized ar­gu­ments, ger­ry­man­dered cat­e­gory bound­aries, &c.), all the while con­grat­u­lat­ing your­self on how “hon­est” you are for never, ever emit­ting any “liter­ally” “false” in­di­vi­d­ual sen­tences.


Ayn Rand’s novel At­las Shrugged[6] por­trays a world of crony cap­i­tal­ism in which poli­ti­ci­ans and busi­ness­men claiming to act for the “com­mon good” (and not con­sciously ly­ing) are ac­tu­ally us­ing force and fraud to tem­porar­ily en­rich them­selves while de­stroy­ing the credit-as­sign­ment mechanisms So­ciety needs to co­or­di­nate pro­duc­tion.[7]

In one scene, Ed­die Willers (right-hand man to our railroad ex­ec­u­tive hero­ine Dagny Tag­gart) ex­presses hor­ror that the gov­ern­ment’s offi­cial sci­en­tific au­thor­ity, the State Science In­sti­tute, has is­sued a hit piece de­nounc­ing the new al­loy, Rear­den Me­tal, with which our pro­tag­o­nists have been plan­ning to use to build a crit­i­cal railroad line. (In ac­tu­al­ity, we later find out, the In­sti­tute lead­ers want to spare them­selves the em­bar­rass­ment—and there­fore po­ten­tial loss of leg­is­la­tive fund­ing—of the in­no­va­tive new al­loy hav­ing been in­vented by pri­vate in­dus­try rather than the In­sti­tute’s own met­al­lurgy de­part­ment.)

“The State Science In­sti­tute,” he said quietly, when they were alone in her office, “has is­sued a state­ment warn­ing peo­ple against the use of Rear­den Me­tal.” He added, “It was on the ra­dio. It’s in the af­ter­noon pa­pers.”

“What did they say?”

“Dagny, they didn’t say it! … They haven’t re­ally said it, yet it’s there—and it—isn’t. That’s what’s mon­strous about it.”

[...] He pointed to the news­pa­per he had left on her desk. “They haven’t said that Rear­den Me­tal is bad. They haven’t said it’s un­safe. What they’ve done is …” His hands spread and dropped in a ges­ture of fu­til­ity.

She saw at a glance what they had done. She saw the sen­tences: “It may be pos­si­ble that af­ter a pe­riod of heavy us­age, a sud­den fis­sure may ap­pear, though the length of this pe­riod can­not be pre­dicted. … The pos­si­bil­ity of a molec­u­lar re­ac­tion, at pre­sent un­known, can­not be en­tirely dis­counted. … Although the ten­sile strength of the metal is ob­vi­ously demon­stra­ble, cer­tain ques­tions in re­gard to its be­hav­ior un­der un­usual stress are not to be ruled out. … Although there is no ev­i­dence to sup­port the con­tention that the use of the metal should be pro­hibited, a fur­ther study of its prop­er­ties would be of value.”

“We can’t fight it. It can’t be an­swered,” Ed­die was say­ing slowly. “We can’t de­mand a re­trac­tion. We can’t show them our tests or prove any­thing. They’ve said noth­ing. They haven’t said a thing that could be re­futed and em­bar­rass them pro­fes­sion­ally. It’s the job of a cow­ard. You’d ex­pect it from some con-man or black­mailer. But, Dagny! It’s the State Science In­sti­tute!”

I think Ed­die is right to feel hor­rified and be­trayed here. At the same time, it’s no­table that with re­spect to wiz­ard’s code, no ly­ing has taken place.

I like to imag­ine the state­ment hav­ing been drafted by an ideal­is­tic young sci­en­tist in the moral maze of Dr. Floyd Fer­ris’s office at the State Science In­sti­tute. Our sci­en­tist knows that his boss, Dr. Fer­ris, ex­pects a state­ment that will make Rear­den Me­tal look bad; the nega­tive con­se­quences to the sci­en­tist’s ca­reer for failing to pro­duce such a state­ment will be se­vere. (Dr. Fer­ris didn’t say that, but he didn’t have to.) But the lab re­sults on Rear­den Me­tal came back with fly­ing col­ors—by ev­ery available test, the al­loy is su­pe­rior to steel along ev­ery di­men­sion.

Pity the dilemma of our poor sci­en­tist! On the one hand, sci­en­tific in­tegrity. On the other hand, the in­cen­tives.

He de­cides to fol­low a rule that he thinks will pre­serve his “in­ner agree­ment with truth which al­lows ready recog­ni­tion”: af­ter ev­ery sen­tence he types into his re­port, he will ask him­self, “Is this state­ment ac­tu­ally and liter­ally true?” For that is his mas­tery.

Thus, his writ­ing pro­cess goes like this—

“It may be pos­si­ble af­ter a pe­riod of heavy us­age, a sud­den fis­sure may ap­pear.” Is this state­ment ac­tu­ally and liter­ally true? Yes! It may be pos­si­ble!

“The pos­si­bil­ity of a molec­u­lar re­ac­tion, at pre­sent un­known, can­not be en­tirely dis­counted.” Is this state­ment ac­tu­ally and liter­ally true? Yes! The pos­si­bil­ity of a molec­u­lar re­ac­tion, at pre­sent un­known, can­not be en­tirely dis­counted. Okay, so there’s not enough ev­i­dence to sin­gle out that pos­si­bil­ity as worth pay­ing at­ten­tion to. But there’s still a chance, right?

“Although the ten­sile strength of the metal is ob­vi­ously demon­stra­ble, cer­tain ques­tions in re­gard to its be­hav­ior un­der un­usual stress are not to be ruled out.” Is this state­ment ac­tu­ally and liter­ally true? Yes! The lab tests demon­strated the metal’s un­prece­dented ten­sile strength. But cer­tain ques­tions in re­gard to its be­hav­ior un­der un­usual stress are not to be ruled out—the prob­a­bil­ity isn’t zero.

And so on. You see the prob­lem. Per­haps a mem­ber of the gen­eral pub­lic who knew about the cor­rup­tion at the State Science In­sti­tute could read the re­port and in­fer the ex­is­tence of hid­den ev­i­dence: “Wow, even when try­ing their hard­est to trash Rear­den Me­tal, this is the worst they could come up with? Rear­den Me­tal must be pretty great!”

But they won’t. An in­sti­tu­tion that pro­claims to be ded­i­cated to “sci­ence” is ask­ing for a very high level of trust—and in the ab­sence of a trust­wor­thy au­di­tor, they might get it. Science is com­pli­cated enough and nat­u­ral lan­guage is am­bigu­ous enough, that that kind of trust that can be be­trayed with­out ly­ing.

I want to em­pha­size that I’m not say­ing the re­port-draft­ing sci­en­tist in the sce­nario I’ve been dis­cussing is a “bad per­son.” (As it is writ­ten, al­most no one is evil; al­most ev­ery­thing is bro­ken.) Un­der more fa­vor­able con­di­tions—in a world where met­al­lur­gists had the aca­demic free­dom to speak the truth as they see it (even if their voice trem­bles) with­out be­ing threat­ened with os­tracism and star­va­tion—the sort of per­son who finds the wiz­ard’s oath ap­peal­ing, wouldn’t even be tempted to en­gage in these kinds of not-tech­ni­cally-ly­ing shenani­gans. But the point of the wiz­ard’s oath is to con­strain you, to have a sim­ple bright-line rule to force you to be truth­ful, even when other peo­ple are mak­ing that gen­uinely difficult. Yud­kowsky’s meta-hon­esty pro­posal is a clever at­tempt to strengthen the foun­da­tions of this ethic by for­mu­lat­ing a more com­pli­cated the­ory that can ac­count for the edge-cases un­der which even un­usu­ally hon­est peo­ple typ­i­cally agree that ly­ing is okay, usu­ally due to ex­traor­di­nary co­er­cion by an ad­ver­sary, as with the prover­bial mur­derer or Gestapo officer at the door.

And yet it’s pre­cisely in ad­ver­sar­ial situ­a­tions that the wiz­ard’s oath is most con­strain­ing (and thus, ar­guably, most use­ful). You prob­a­bly don’t need spe­cial eth­i­cal in­hi­bi­tions to tell the truth to your friends, be­cause you should ex­pect to benefit from friendly agents hav­ing more ac­cu­rate be­liefs.

But an en­emy who wants to use in­for­ma­tion to hurt you is more con­strained if the worst they can do is se­lec­tively re­port harm­ful-to-you true things, rather than just mak­ing things up—and there­fore, by sym­me­try, if you want to use in­for­ma­tion to hurt an en­emy, you are more con­strained if the worst you can do is se­lec­tively re­port harm­ful-to-the-en­emy true things, rather that just mak­ing things up.

Thus, while the study of how to min­i­mize in­for­ma­tion trans­fer to an ad­ver­sary un­der the con­straint of not ly­ing is cer­tainly in­ter­est­ing, I ar­gue that this “firm­ing up” is of limited prac­ti­cal util­ity given the ubiquity of other kinds of de­cep­tion. A the­ory of un­der what con­di­tions con­scious ex­plicit un­am­bigu­ous out­right lies are ac­cept­able doesn’t help very much with com­bat­ing in­tel­lec­tual dishon­esty—and I fear that in­tel­lec­tual dishon­esty, plus suffi­cient in­tel­li­gence, is enough to de­stroy the world all on its own, with­out the help of con­scious ex­plicit un­am­bigu­ous out­right lies.

Un­for­tu­nately, I do not, at pre­sent, have a su­pe­rior al­ter­na­tive eth­i­cal the­ory of hon­esty to offer. I don’t know how to un­ravel the web of de­ceit, ra­tio­nal­iza­tion, ex­cuses, dis­in­for­ma­tion, bad faith, fake news, phoni­ness, gaslight­ing, and fraud that threat­ens to con­sume us all. But one thing I’m pretty sure won’t help much is clever logic puz­zles about im­plau­si­bly so­phis­ti­cated Nazis.

(Thanks to Michael Vas­sar for feed­back on an ear­lier draft.)


  1. I’m scare-quot­ing “in­tended” be­cause this pro­cess isn’t nec­es­sar­ily con­scious, and prob­a­bly usu­ally isn’t. In­ter­nal dis­tor­tions of re­al­ity in im­perfectly de­cep­tive so­cial or­ganisms can be adap­tive for the func­tion of de­ceiv­ing con­speci­fics. ↩︎

  2. If I had writ­ten this post, I would have ti­tled it “Fake Com­mon Knowl­edge” (fol­low­ing in the tra­di­tion of “Fake Ex­pla­na­tions”, “Fake Op­ti­miza­tion Cri­te­ria”, “Fake Causal­ity”, &c.) ↩︎

  3. But it’s worth not­ing that the “Is this state­ment ac­tu­ally and liter­ally true?” test, taken liter­ally, should have caught this, even if my in­tu­ition still doesn’t want to call it a “lie.” ↩︎

  4. Ac­tu­ally, that’s not liter­ally true! You of­ten open your mouth to breathe or eat with­out say­ing any­thing at all! Is the refer­ent of this foot­note then a blatant lie on my part?—or can I ex­pect you to know what I meant? ↩︎

  5. A similar phe­nomenon may oc­cur with other at­tempts at eth­i­cal bind­ings: for ex­am­ple, con­fi­den­tial­ity promises. Sup­pose Open Opal tends to wear her heart on her sleeve and more speci­fi­cally, be­lieves in lies of omis­sion: if she’s talk­ing with some­one she trusts, and she has in­for­ma­tion rele­vant to that con­ver­sa­tion, she finds it in­cred­ibly psy­cholog­i­cally painful to pre­tend not to know that in­for­ma­tion. If Para­noid Paris has much stronger pri­vacy in­tu­itions than Opal and wants to mes­sage her about a sen­si­tive sub­ject, Paris might de­mand a promise of se­crecy from Opal (“Don’t share the con­tent of this con­ver­sa­tion”)—only to spark con­flict later when Opal con­strues the literal text of the promise more nar­rowly than Paris might have hoped (“‘Don’t share the con­tent’ means don’t share the ver­ba­tim text, right? I’m still al­lowed to para­phrase things Paris said and at­tribute them to an anony­mous cor­re­spon­dent when I think that’s rele­vant to what­ever con­ver­sa­tion I’m in, even though that hy­po­thet­i­cally leaks en­tropy if Paris has im­plau­si­bly de­ter­mined en­e­mies, right?”). ↩︎

  6. I know, fic­tional ev­i­dence, but I claim that the kind of de­cep­tion illus­trated in quoted pas­sage to fol­low is en­tirely re­al­is­tic. ↩︎

  7. Okay, that’s prob­a­bly not ex­actly how Rand or her acolytes would put it, but that’s how I’m in­ter­pret­ing it. ↩︎