The Japanese Quiz: a Thought Experiment of Statistical Epistemology

This post is an excerpt from my book, “Notes on a New Philosophy of Empirical Science”. I’ve found it hard to publish small pieces from the book, because each concept is inextricably linked in my mind to all the other concepts. But I think this section and the related discussion can stand on its own. It starts with a story about a physicist named Sophie, who encounters an odd group of language activists while on holiday.

The Society for the Promotion of the Japanese Language

After spending many months developing her theory, building the software to test the theory, and writing up the results for publication, Sophie decided to take a long vacation on the West Coast. She visited the cafes and bookstores of Vancouver and Seattle, and took long hikes in the beautiful parks of the Pacific Northwest. Then she took the train down through Oregon and Northern California to San Francisco, where she planned to enjoy the mild weather and visit some old friends who worked in the tech sector.

During her stay in San Francisco, Sophie decided to visit the city’s Japantown, because she had a special appreciation for artisanal Japanese kitchenware, which she found to be useful not just for cooking but sometimes also for ad hoc repair work in the lab. As she was window shopping in the neighborhood, a strange sight caught her attention. It was a booth, manned by well-dressed, older Asian people, who seemed to be distributing pamphlets. At first she thought it was an social activist group, but then she noticed that a poster with the words “Society for the Promotion of the Japanese Language” at the top. Sophie had always been interested in Japanese, so she approached the table.

After a comically polite interaction with the one of the fashionable men, she came to learn that the organization was working to encourage people across the globe to learn Japanese. However, they did not provide courses or educational material themselves. Instead, their strategy was to create an incentive for people to learn the language, by offering a prize of ten thousand dollars to anyone who reached a certain level of proficiency. They measured this proficiency by proctoring a short quiz.

Sophie did not know any Japanese, but she had long believed that she could pass any standardized test simply by exploiting hints that were implicit in the phrasing of the questions (it’s usually “all of the above”). So she paid an application fee, and was lead to a nearby testing facility, and given a seat at a desk in a small room, which contained a couple of other glum test-takers. The test proctor gave her a copy of the exam, and she filled out her name and the date on the front page. The second page had some instructions in Japanese, which Sophie did not understand. In addition, it had two columns of 10 Japanese sentences each, marked A and B. Beneath the columns were 10 additional sentences, with a blank spot next to each one. The sentences were as follows [1]:

Category A

バスケのアルビ、運営会社長がパワハラ　無期限謹慎に
酒井と植田、フル出場　サッカー、フランス部
ＮＨＬ、パンサーズが６連勝　第１２週、中地区で首位
松山は３０位、スピースが優勝　米男子ゴルフ最終日
教科書に載った「ゼッケン67」あの選手の孫が日本で
マスターズ、８日夜から　２９歳松山英樹、練習に熱
大谷の記録を三塁打に訂正２日の試合、打率３割に
レッドソックス沢村、１回無失点
明徳義塾高の馬淵史郎監督、野球U18代表監督に就任
五輪の聖火、愛知の豊田へ三重に引き継ぐ

Category B

インドネシアで土砂崩れと大洪水、69人死亡　被害拡大
衝突トラック、列車の通過前から線路上に　台湾脱線事故
ローマ教皇「最貧国にワクチンを」　復活祭ミサで訴え
韓国LG電子、スマホ事業から撤退　6期連続の営業赤字
沖縄で155人感染、過去2番目　知事「強い対策も」
「含まれない」一転、泡消火剤に有害物質　空自那覇基地
無人トラックが20m走り列車に…50人死亡脱線事故
北朝鮮のサイバー攻撃「1日158万回」　資金狙いか
タイのゾウ、重さ1.7キロの膀胱結石　尿が出ず…手術
衝突1.9秒前に急ブレーキ　台湾脱線事故、映像を公開

Unlabelled

CNNの取材を受けた市民拘束　ミャンマーで現地報道
パラ幅跳び日本記録を体感　体育館にシート登場　埼玉
26歳スイマーが初の五輪切符　引き留めた恩師の言葉
あの竹刀、外国選手対策だった　全空連強化委員長が辞任
青山、柴原組が今季３勝目
ミスコン出場の大学生「ミャンマーを助けて」　涙の訴え
豪州人のスーチー氏顧問も拘束、訴追　豪政府「解放を」
ナイジェリアで刑務所襲撃、銃撃戦　囚人1800人脱走
五輪決めたスイマーが抱き寄せた　敗れた28歳との絆
ヴィトンのロゴも…粗悪な偽マスク、1400点輸入拒む

Based on this format, Sophie deduced that the challenge was to infer the rule that was used to separate the phrases into the two categories, and then apply the rule to mark the unlabeled sentences. Unfortunately, she was dismayed to find that her normal tricks for handling standardized tests did not seem to work. She observed that the blocky, simple characters (katakana) were more common in category A, but only by a small margin. She also noticed that category B had longer sequences of the complex characters (kanji). Based on these observations, she filled out the blanks and handed in the paper. She was not surprised to learn that her score was ⁶⁄₁₀ - not much better than random. The proctor smiled apologetically and encouraged Sophie to work hard and do better next year.

A Failure of Statistical Epistemology?

Let’s look at the challenge posed by the Japanese quiz through a statistical lens. It’s quite easy to formulate the problem in the framework of classical supervised learning: the inputs are the Japanese phrases, the labels are the A/B categories, and the challenge is to predict the labels of the holdout group. One could easily train an ML classifier on this data using the Scikit-Learn library and a few lines of Python code.

With a little bit more work, you could also approach the quiz as a problem of Bayesian inference. You start by writing down a list of hypotheses about the relationship between the characters and the categories - perhaps the sentences with katakana characters are more likely to produce phrases from category A. Next, you write down a prior distribution: a reasonable option would be to give each of the T hypotheses a probability of 1/T. Then you apply Bayes’s rule to update the probabilities based on the extent to which they agree with the data. Finally you make predictions for the holdout group by combining all the individual predictions, weighted by the posterior probability of each hypothesis.

While these approaches are easy to formulate, a convenient formulation does not guarantee success. Neither the classic ML idea nor the Bayesian version will have much chance of success at this problem. The puzzle is simply too complex, compared to the amount of data that is available. In ML terms, the model will have to be very simple to avoid overfitting, but a simple model will not capture enough knowledge of Japanese text to solve the problem well. In Bayesian terms, unless your prior is very good, you will need to use an extremely large number of hypotheses, corresponding to a very large T. But since the data set size N is small, the posterior probability will still be spread out among many hypotheses, leaving no clear winner.

Now, the fact that classic ML and Bayes can’t do very well in this scenario is not a terrible black mark against them, since most humans can’t solve this puzzle either. If humans, with our big pattern-recognizing brains that are well-adapted to linguistic analysis, cannot solve it, then perhaps the puzzle is simply intractable in some fundamental statistical sense.

But there’s a subtlety here. While most humans cannot solve the problem, a special category of humans—those with the ability to read Japanese - will solve the problem easily. Japanese readers possess some special background knowledge that can be applied to puzzles of this type. In fact, this knowledge is decisively significant: the problem is either trivial or impossible, depending on whether the person can read Japanese. Furthermore, it is obvious to humans how to obtain this background knowledge. A human who knows that, in the future, she will be faced with many challenges like the Japanese quiz has no problem figuring out what to do. She should go study Japanese, and once she has a certain level of skill, the challenges will become easy.

This brings us to the point that is a bit embarrassing for Bayes and classic ML. If the right strategy to prepare for these challenges is obvious from the perspective of human folk epistemology, then it should be obvious for a good statistical epistemology as well. But that’s not the case. Neither Bayes nor classic ML tells us that, for certain categories of problem, the first step towards a solution is to build up an extensive database of background knowledge.

Transfer Learning to the Rescue: GPT-3 as Prior

In the discussion above I was careful to refer to the epistemology of classic ML. Unless you’ve been living under an enormous rock, you know that the field of ML has been moving and changing at a remarkable pace in the last decade. Modern ML has developed a new technique called transfer learning that can address problems like the Japanese quiz.

In the subfield of NLP, most modern applied work follows a standard pattern based on transfer learning. First you define your problem and collect some labelled data - this is often done using a crowdsourced data annotation service such as Amazon Mechanical Turk. Perhaps, simply as an exploratory step, you train some standard ML models directly on the new data. Generally speaking, your performance won’t be very good. To get good performance, you start by downloading a pre-trained model such as BERT. You feed your text into BERT and use the network activations of one of the upper layers as the input to a classic ML algorithm (which might just be a fully connected NN layer). Because the pre-trained model has a sophisticated understanding of English, it is able to transform the low-level text patterns into high-level, abstract concepts about what the text means. These high-level concepts are often exactly what you need to solve your specific problem, so the final ML algorithm, which is trained on your small data set, can get very good performance [2].

This strategy has been taken to an even higher level by the GPT-3, OpenAI’s astonishingly powerful language model. GPT-3 is such a powerful pre-trained model that for many applied NLP tasks, you don’t even really need to do the data collection and last-mile ML training. Instead, the trick with GPT-3 is prompt engineering: you need to figure out how to compose an initial set of sentences, such that GPT-3′s continuation of the sentences will contain the information you want. The power to solve such tasks without an additional, task-specific training procedure is called zero-shot learning.

The power of the pre-trained model is an exact parallel to the power of language ability in the thought experiment about the Japanese quiz. In both cases, a problem that was impossible without strong prior knowledge suddenly becomes trivial. This parallel suggests that the idea of transfer learning can fill the gap in statistical epistemology that we observed above.

Translating between ML terms and Bayesian concepts, we can identify pre-trained models such as BERT and GPT-3 with the Bayesian prior. The models contain background knowledge that enables us to formulate a good prior by using concepts and abstractions that are very well-suited to the problem domain. So, phrased in Bayesian terms, the methodology of transfer learning is something like “First study some related task to obtain an exceptionally precise prior, then use that prior to solve the problem of interest”.

While the idea of transfer learning makes intuitive sense, it is philosophically murky and there are many remaining questions. The transfer learning strategy works for NLP and computer vision, but does it work in general? Are we satisfied that these knowledge bases - which may in time come to undergird a huge portion of our technological civilization - should be expressed in the opaque language of neural network layers and weights? Can we objectively verify the quality of a knowledge base? Perhaps most important is the problem of trust - how can we convince skeptical outsiders that a knowledge base is scientifically valid?

The Problem of Trust

For most tasks in NLP and computer vision, there is usually not much need for public trust. A company develops a system for its own use, and if the system performs poorly, the company suffers the consequences. But for many real-world applications of statistics, particularly in the realm of medicine, public trust is absolutely essential. Consider the following stylized dialogue between a FDA official and a researcher from a pharmaceutical company (RPC) who wishes to obtain approval for a new life extension drug [3].

RPC: Hello, thanks for taking the time to review our application.

FDA: Sure.

RPC: We have developed a life-extension drug with amazing powers! We conducted an RCT, and the members of the treatment group benefitted by an entire year of additional life compared to the control! Here is the data.

FDA (studying). Hmmm. This table seems to indicate that the actual life spans were the same between the two groups.

RPC: That’s what a superficial analysis shows. In reality, due to statistical variations, the treatment group happened to be less healthy overall. So the fact that the two groups had about the same lifespan actually indicates the efficacy of the drug. You can see the health scores for each individual in Appendix B.

FDA: Okay, but how did you calculate these health scores?

RPC: (produces a 32-GB USB memory stick). We conducted an extensive biometric analysis of each individual, and calculated the health score using a sophisticated model of human biology. All the data is on this USB.

FDA: What do you mean “model of human health”? Where did you get this model?

RPC: (nervous) We… uhhm… developed it on our own, using sophisticated deep learning methods.

FDA: (exasperated) This is nonsense. How are we supposed to verify that your model is correct? You probably just generated a bunch of fake numbers. Come back when you have a real analysis to show me.

RPC: But… this is a wonder drug! It will save billions of QALYs! Every year you delay equates to millions of lost lives!

FDA: (shuffles the researcher out of the office) Next!

While many rationalists are sympathetic to the longevity researcher, I think most of us can see the FDA official’s perspective. The PCR’s argument depends on a sophisticated database of background knowledge - the “health model” that is able to assign health scores to each experimental subject. The FDA official cannot simply take the researcher’s word that this prior knowledge is valid.

Toward a New Epistemology of Statistics

I believe that the issues I’ve outlined above point toward a dramatic reconception of what it means to do statistics. In the classical approaches, the problem of obtaining good prior knowledge is swept under the rug. The frequentists suggest that prior knowledge is scientifically inadmissable, because it is not objective. The Bayesians claim (correctly) that the prior is essential. But the focus of Bayesian analysis is on what to do after you write down the prior and observe the data - not much time is spent on figuring out how to obtain the prior in the first place [4].

In the new perspective, obtaining good prior knowledge is the whole game. What procedures can we utilize to refine our prior? Given that these knowledge databases may be enormously complex (GPT-3 has billions of parameters), what is the best form in which to represent them? To what extent can knowledge obtained in the pursuit of some task A be appropriately transferred to some related task B? Given two priors, how do we choose which one is better, before observing the data for a particular problem? How do we convince third parties—like the FDA—that our prior knowledge is scientifically legitimate?

I believe scientists of the future will look back with bemused disdain at the statistics-based research that is so common today in fields like medicine, nutrition, economics, and psychology. How could any reasonably intelligent person believe that an N=200 statistical analysis could yield any meaningful conclusions about a system as complex as the human body or the global economy, without an enormously sophisticated assemblage of background knowledge? Of course you need to start by constructing the background knowledge, and of course that is the main part of the challenge.

Many rationalists subscribe to the Bayesian paradigm of statistics. And Bayes is correct—as far as it goes. But Bayes will allow its adherents to bang their heads against the wall of problems like the Japanese quiz, wasting research dollars and years of life. A fully-realized statistical paradigm should shout at us to postpone work on problems like the quiz and focus instead on obtaining highly refined prior knowledge. A modern, updated version of Bayesian philosophy could open the door to a vast new continent of unexplored scientific territory.

Notes

[1] The phrases are headlines from the Asahi Shimbun, you can view the data here, including Google Translations of them. Category A are sports headlines, Category B are general international news. If you can read katakana, you might notice that several of the A sentences contain transliterated English words for sports, like サッカー (soccer) and ゴルフ (golf).

[2] I haven’t tried it, but I’m quite confident that this kind of approach would work for the Japanese data I assembled—using a Japanese version of BERT, of course.

[3] I like to present problems of statistical philosophy in terms of dialogues and thought experiments for two reasons. First, it is a way of paying homage to two of my strongest influences: Nassim Taleb and Douglas Hofstadter. Second, the Wason Selection Task experiment shows us that humans are better at understanding the flow of logical reasoning when the logic is embedded within a socially tense situation.

[4] This orientation may be inherited from classical statistics, where most analysis is oriented towards asymptotic (large N) conditions. In this regime, the influence of the prior dwindles to insignificance. But in many real-world applications, N remains small.