This essay explains the Direct Approach proposed by
[@barnettScalingTransformativeAutoregressive2023].[1] I encourage
you to play with the Direct Approach Interactive
Model to
explore an interactive simulation using the approach.
The Direct Approach framework bounds the compute requirements for
transformative AI by extrapolating neural scaling laws. We combine
those estimates with simple models of future progress in algorithms,
investment, and compute costs to produce a user-adjustable forecast
over the date at which TAI will be achieved.
[@barnettDirectApproach2023]
From the POV of the judge, a Turing test is a sequential test for two
statistical hypotheses—“is human” and “is machine”. Under reasonable
assumptions, halving the (reducible part of) log-perplexity loss of the
language model would double the time it can survive in a Turing test.
We can think of the peer-review of scientific papers as a Turing test,
and say that AGI has arrived when we have AI scientists that can pass
the papers peer-review. This allows us to calculate the log-perplexity
loss of the first AGI. If we assume it is just a scaled-up GPT, then
assuming the Chinchilla scaling law, it would cost about 200 years of
global GDP. This makes it virtually certain that
the first AGI will not be a scaled-up GPT.
Turing test as statistical hypothesis test
Turing test
In the Turing test, there
are three players: one judge and two players. The first player is a
human, and the second is a machine. The judge asks each player text
questions and receives text answers. The judge must decide who is the
human.
We consider a simplified Turing test. In this test, the judge does not
ask, and simply receives one stream of text X1:∞. The judge
must decide whether the stream is produced by the human or the machine,
and do so quickly.
Cast in the language of statistical hypothesis testing, we have two
hypotheses:
H0: “the stream is produced by the human”;
H1: “the stream is produced by the machine”.
The judge would read from the stream X1:∞, o-n-e- -t-o-k-e-n
at a time, and at each token, decide whether to take another one, or
announce its judgment: H0 or H1.
As the organizers of the Turing test, we would start the test by
flipping a fair coin to decide whether to use the human or the machine.
Therefore, Pr(H0)=Pr(H1), and by Bayes, the posterior
log-probability ratio is
lnPr(X1:n|H0)Pr(X1:n|H1)=lnPr(H0|X1:n)Pr(H1|X1:n)
This allows us to use the sequential probability ratio
test
(SPRT). The judge would decide on two decision boundaries, and calculate
lnPr(X1:n|H0)Pr(X1:n|H1) at each token. It would
stop and announce the decision as soon as the quantity exceeds one of
the boundaries.
For example, suppose the judge wants to decide when the odds ratio is 10
to 1, then it would set the decision boundaries to be
[−ln10,+ln10]. If lnPr(X1:n|H0)Pr(X1:n|H1)
goes above +ln10 when n=60, then the judge would announce
“H0” at that point.
The ln10 is a good rule of thumb, which we will use for the
remainder of the essay.
Sequential hypothesis testing
Consider the following simple equation:
1nEX∼Pr(⋅|H0)[lnPr(X1:n|H0)Pr(X1:n|H1)]1nDKL(Pr(⋅|H0)∥Pr(⋅|H1))=1nEX∼Pr(⋅|H0)[ln1Pr(X1:n|H1)]negative log-likelihood loss per token−1nEX∼Pr(⋅|H0)[1lnPr(X1:n|H0)]entropy rate of the human itself{#eq-sprt}
The first term is the KL-divergence per token between the machine and
the human. Roughly speaking, it is how different they are, per token
emitted. It is an information-theoretic quantity.
The second term is negative log-likelihood loss per token. This is what
language models are trained to minimize. We write it as L.
The third term is the entropy rate of the human. It is how random the
human is. We write it as L∞, because it is the theoretical
minimal loss that the language model can reach.
If the machine is a perfect replica of the human, then the second term
is zero, and the first term equals the third term.
Assuming that the human is an ergodic speakers of English,[2] we can
sample an infinite stream X1:∞ from the human, then call up
the Shannon—McMillan—Breiman theorem and find that
1nlnPr(X1:n|H0)Pr(X1:n|H1)→L−L∞
On the other hand, if the machine is also an ergodic speaker of English,
then we can sample an infinite stream X1:∞ from the machine,
then call up the SMB theorem and find that
1nlnPr(X1:n|H1)Pr(X1:n|H0)→L′−L′∞
where unfortunately, we have the odd L′ and L′∞, defined by
We can interpret them as the loss of the human at imitating the machine,
and the entropy rate of the machine itself. When the machine is close
enough to the human, we can take the approximation
L′≈L,L′∞≈L∞.
Now, define the log-ratio at step n to be
rn:=Pr(X1:n|H0)Pr(X1:n|H1). During a Turing test,
the judge calculates
So, imagine that such a perfect judge is going through a Turing test,
upon receiving “my cat is technically”, and we are listening on its
thoughts:
“If it were a human, then it would start with ‘my’ with probability
0.01. If it were a machine, then 0.05. Therefore, the odds ratio
is 2 to 1.”
“If it were a human, then it would follow ‘my’ with ‘cat’ with
probability 0.01. If it were a machine, then 0.033. Therefore,
the odds ratio is 3 to 1.”
“If it were a human, then it would follow ‘is’ with ‘my cat’ with
probability… I do not know. However, I do know that the odds
ratio is 2 to 1. Now that the total odds ratio is 12 to 1, I can
decide: H0.”
We see that the judge does not have to know the probabilities
Pr(X1:n|H0) and Pr(X1:n|H1), only their ratio. This might
be a minor point, but this idea of likelihood ratio is quite important.
It is like “I don’t know how often you say ‘cat’ but I know that you say
it twice as often than I do!”.
Let T∗ be the time it takes for the judge to decide.
T∗≈ln10L−L∞
Intuitively, each token on average moves the log-probability-ratio
away from 0 by another (L−L∞). Decision is triggered when it
finally moves out of the interval [−ln10,+ln10].
We are not able to simply look at a few tokens, draw a straight line,
and call it a day, because the trajectory of log-probability-ratio is
much closer to a random walk with drift. Subjectively, if you were a
judge and watching the log-probability-ratio moving, you’d see ups and
downs, keeping you in suspense, until it finally crosses the decision
boundaries.
Slowdown factor
To perform the SPRT as described, the judge must know intimately the
difference between a human and a machine. Can the judge do that? Can
anyone know, with certainty, that I would start my speech with “Forty
cats …” with a probability that is exactly 32.42 times that of
GPT-3?
As a crude approximation, we can model real-world judges as slowed-down
version of the perfect judge. We can imagine that at each step, instead
of updating the log-ratio by
lnrn+1←lnrn+lnPr(Xn+1|H0,X1:n)Pr(Xn+1|H1,X1:n)
we update it by
lnrn+1←lnrn+1slnPr(Xn+1|H0,X1:n)Pr(Xn+1|H1,X1:n)
where s>1 is the slowdown factor. This implies that if it takes
∼T tokens for the perfect judge to reach a likelihood ratio of
r, it would take ∼sT tokens for a human judge.
Measuring the slowdown factor
The slowdown factor s is unknown.
Informed by an internal poll, we enforce a lognormal distribution with
a median of 53.1, a 15th percentile estimate of 9.84, and an 85th
percentile of 290. [@atkinsonDirectApproachInteractive2023]
The original paper [@barnettScalingTransformativeAutoregressive2023]
contains no estimate of s. They did propose to measure it
experimentally by running the Turing test with a human judge and two
language models. One model H0 “perfectly imitates humans” by simply
sampling a random text segment from a corpus, and the other model H1
is a trained language model, finetuned to imitate the same corpus. They
claimed that for any piece of text X1:n, they can calculate the
log-ratio lnPr(X1:n|H0)Pr(X1:n|H1), but I found it
difficult: Suppose X1:n= technically fork, which is
unlikely but possible, yet the phrase never appears in the corpus, what
should be Pr(X1:n|H0)? We can use one of the many smoothing tricks
[@jurafskySpeechLanguageProcessing2023, chapter 3], but this gets
complicated.
What I think would work well is if both H0and H1 are language
models, perhaps even the same model with different sampling
temperatures, then the human judge only has to distinguish the two
models.
Perhaps we can make this into a gambling game. The human subject would
be presented with two long outputs from two hidden Markov models. Then
the subject becomes the judge of a Turing test: “Are you seeing the
output from machine 0 or machine 1?”. At each step, the subject can
either pay a few cents of fake money to see another character, or stop
and make a bet with the entire bankroll: “I bet 70% of my bankroll on
machine 0 and the rest on machine 1!”. Both bets have payoff odds of
2:1. I believe that if the cost of seeing another character is just
right, the subject would be nudged to make a decision at exactly 10:1
posterior odds ratio on the two hypotheses “machine 0” and “machine 1″.
There was one large-scale attempt at the Turing test in early 2023, in a
game called “Human or Not?” [@jannaiHumanNotGamified2023]. Human
participants took 2-minute conversations, and at the end, had to decide
whether they were talking to a human or a bot.[3]
The conversations have a “ping-pong” structure that prevents players
from sending two consecutive messages without a response, in order to
ensure a balanced and dynamic exchange. Each message, limited to a
maximum of 100 characters, has to be composed and sent within a
20-second window, and the chat ends after 2 minutes, usually
consisting of 4-5 messages from each side. This ensures that players
don’t have to wait for too long, so they can remain engaged with the
game and a constant suspense is kept. Once the conversation is over,
players are prompted to guess whether their conversational partner was
a fellow human or an AI bot. [@jannaiHumanNotGamified2023]
I counted that during a typical message, each side sends [20,40]
English words in total, or [30,50] tokens. In [60%,70%] of
trials, the human participant judged correctly. This suggests that the
log-ratio achieved after [30,50] tokens is around the range of
[±ln6/4,±ln7/3]. In other words, the average log-ratio per
token is
[ln6/4,ln7/3][30,50]∈[0.01,0.03]nat/token
They used several different AI, ranging between Jurassic-2, GPT-4, and
Cohere. None of them have published their training compute or loss
curves. The only good estimate is for GPT-4, which has training cost
C=2×1025FLOP.
Assuming that Chinchilla scaling holds,
average log-ratio per token that an ideal judge should achieve is
L−L∞=1070C0.154=0.14nat/token.
Therefore,
s∈[5,14]
I did not expect the estimate to be nearly symmetric around 10.
Entropy of natural languages
In @eq-sprt, we argued that L∞should be interpreted as the
entropy rate of the source, usually human-generated English.
Unfortunately, unlike that of coin flips or Markov chains, the entropy
rate of English cannot be calculated, only estimated. Fortunately, it
can be estimated in several ways, and we can check their agreement.
Since tokenizers are temporary, but English is permanent, we convert all
units to bit/character for easy comparison.
Chinchilla scaling
In the Chinchilla scaling law paper, the authors trained many language
models with various sizes from a single architecture family, and fitted
a statistical law to the data, giving
L∞=1.69nat/token (without error bars, unfortunately)
[@hoffmannTrainingComputeOptimalLarge2022, page 25].
To find the effective bit/character for the Chinchilla scaling
law, we need to convert nat to bit, and token to
character. The first is easy:
1bit=ln(2)nat. The second can be estimated
by running a tokenizer over a large natural English corpus. I have
estimated this by running the GPT-2 tokenizer on the
WikiText-2 corpus,
and found that on average,
1token=4.5character=0.85word
Thus,
L∞≈1.694.5×ln2=0.54bit/character.
Guessing game
The earliest attempt to measure the entropy rate of English is by
Shannon himself [@shannonPredictionEntropyPrinted1951]:
[0.6,1.3]bit/character. He obtained the estimate by
presenting human subjects n−1 characters from a text, and ask them to
guess the next character repeatedly, until they got it right. In this
case, the optimal strategy is to construct the n-gram table, and pick
the argmax character for the given (n−1)-gram, then the arg-next-max,
and so on.
Let N be the total number of characters allowed—Shannon’s
experiment used N=27, with 26 lowercase letters and one white space.
Let pk be the frequency that the subject makes exactly k guesses --
including the correct guess, so that ∑Nk=1pk=1. By
convention, pN+1:=0. Shannon derived both an upper and a lower
bound for the entropy per character:
N∑k=1k(pk−pk+1)lnk≤H≤−N∑k=1pklnpk
The upper bound is proved by Shannon’s source coding
theorem.
Taking a human subject, copy it, then they can be used as an
encoder-decoder pair.[4] The lower bound is not only tricky to prove,
but also wrong in general. It is only correct when the human subject
is the optimalN-gram predictor.[5] Because of this, I do not
recommend using this lower bound, but will quote it anyway.
Over the years, others devised other methods to estimate this entropy.
For example, [@coverConvergentGamblingEstimate1978] used a gambling
game estimation, in the style of the Kelly
criterion. Subjects were
required to divide their entire bankroll into 27 differently-sized bets
over 27 possibilities (26 letters and 1 whitespace). The right bet pays
back 27-fold, and the other bets are lost. Let Sn be the size of
bankroll after n rounds of betting, then
H≤ln27−limsupn1nlnSn
They found that H≤1.3bit/character.
The guesser does not have to be a human. It can very well be a language
model. [@brownEstimateUpperBound1992] made a simple trigram model
over the Brown corpus (600 million words), and found that it gives
H≤1.75bit/character.
[@behrjrEstimatingComparingEntropy2002] used a model that combines
multiple n-gram models, giving H≤1.46bit/character.
Lossless compression
Another way to estimate is by lossless compression of a large corpus,
since the entropy rate is the lower bound on compression rate. In more
detail, if you have a source of information emitting symbols, and its
symbol stream has an entropy rate of xbit/symbol, then it
takes at least ∼xl bits to encode a long segment with l symbols.
Furthermore, this lower bound is approachable using the entropy
encoding.
The Hutter prize is a competition for
compressing a 109-byte corpus from the English Wikipedia (enwik9).
For the size of the finished product, both the algorithm and the
compressed data must be counted. In particular, if a neural network is
used, then the size of the neural network weights must be counted as
well.
The enwik9 dataset is in XML format, and thus contains a lot of
non-English content like <timestamp>2005-12-27T18:46:47Z</timestamp>.
It has 109 bytes. It is tricky to decide how to clean it up to remove
all the XML formatting. As a simple estimate, we counted its words and
characters directly with Linux command wc without any preprocessing,
which gives us
The standard zip algorithm can compress it down to about 300 Mb in size,
a compression ratio of ∼3×. Over the years, the progress has
been slow but somewhat steady. The current winning entry (Saurabh Kumar,
2023) has a compression ratio of 8.76×. If we extrapolate the
prize-winning entries over the years, it seems that the best possible
compression ratio is ∼10×.
Similar to the Hutter prize, the Large Text Compression
Benchmark also asks for
compressing the enwik9 dataset. However, there is no limit to the
algorithm runtime or size, so the compression ratio for this benchmark
is always higher. Currently (2024-01-19), the maximal compression rate
reached is 9.35× with nncp v3.2,
which uses a small Transformer model.
[@grassbergerDataCompressionEntropy2002] used a substitutional
compression algorithm with increasingly large codebooks. When the
codebook had 6000 codes, the algorithm gave
h≤1.82bit/character. By extrapolating the {codebook
size}-{entropy rate} curve to an infinitely large codebook, they
estimated that English has entropy rate
0.7±0.2bit/character.
Notably, the above table has mostly upper bounds, and only one dubious
lower bound (by Shannon) from 1951. Perhaps lower bounds can be
established by using randomness
extractors on a
large corpus, and checking that the output from the extractor passes
pseudorandomness tests.
Forecasting AGI
According to the Chinchilla scaling
law
[@hoffmannTrainingComputeOptimalLarge2022], if we have a fixed amount
of computing budget C, by choosing the model and dataset size
correctly, the minimal reducible loss achievable is
Assuming a slowdown factor s, that the judge decides when the odds
ratio is r:1, and the Chinchilla scaling law, we have a direct method
to predict how long a language model can survive in a Turing test,
according to the cost of training compute C:
T∗∼slnr1070(C/FLOP)0.154token
This gives, as a rule of thumb, 100× compute means 2×
length of survival in a Turing test.
For example, assuming a slowdown factor of s=10, and that the judge
decides when the odds ratio is 10:1, for a language model to survive
for 1000 tokens, it needs
L−L∞≤10×ln10/1000=0.023nat/token
If GPT-4 costs 2×1025FLOP in compute, and
1word≈1.2token, then
T∗≈170 tokens≈150 words
meaning it has a good chance of passing the Turing test if limited to
only 150 words. For context, the Attention is All You Need paper has
an abstract that’s 200 tokens long.
A typical scientific paper is about 4000 words long, which is 27×
that of 150 words, so it would need
271/0.153=(2×109)× that of compute. Assuming that
GPT-4 cost 10 million USD to train, this hypothetical AI would cost
2×1016 USD, or 200 years of global GDP2023.
This implies that the first AGI will not be a scaled-up GPT --
autoregressive transformer generatively pretrained on a lightly filtered
text dataset. It has to include something else, perhaps multimodal data,
high-quality data, better architecture, etc. Even if we were to attempt
to merely scale it up, turning earth into a GPT-factory,[6] with even
50% of global GDP devoted,[7] and with 2% growth rate forever, it would
still take 110 years,[8] arriving at year 2133. Whole brain emulation
would likely take less time.[9]
Appendix: Ergodic theory
Since we used ergodic theory during the essay, we should quickly explain
what it is about. This section is foundational, but the full complexity
is not necessary.
Measure-theoretic POV
I know, you know too, nobody really likes measure theory any more than
pianists like practicing scales hundreds of times. Still, it is at the
right level of abstraction for many theories, including probability.
We omit all mentions of “almost-everywhere”, “except on a set of measure
zero”, and similar annoying phrases. As long as you never make a union
of uncountable many subsets, you will not be hurt by this omission.
A probability space
is a measurable space with a measure of 1. We write it as
(Ω,B,Pr), where B is the sigma-algebra of
measurable sets, and Pr is the probability measure. We also write
μ for the measure.[10]
We consider a single measurable function T:Ω→Ω, and
call it the shift map.
We demand that Tmustpreserve measure. That is,
∀S∈B, we have Pr(T−1(S))=Pr(S).
A subset is measurable iff it is an element of B. A
measurable set is also called an event.
A subset S∈B is T-invariant iff T−1(S)=S almost
everywhere.[11] Let I be the set of all T-invariant
subsets:
I:={S∈B:T−1(S)=S}
Now, obviously any set of measure zero or one are T-invariant. We say
that those are triviallyT-invariant. We say that T is ergodic
iff I has only such trivial subsets. In other words, T is
ergodic iff it cannot be factored into two nontrivial chunks:
S,S′ partitions Ω,such that T−1(S)=S,T−1(S′)=S′,Pr(S)>0,Pr(S′)>0
We usually ask T to also be ergodic, though sometimes we don’t need
that.
Ergodic maps have many very good properties. We will use the following
one. For the theorem, you can picture it as the real space
Rn with the gaussian probability distribution, but in fact,
it applies for just about everything we would care about, such as the
space of English texts, queuing
jobs, random
walks, etc.[12]
Theorem: If the state space is a topological space with a countable
basis, and any
nonempty open set has positive measure, then almost any X∈Ω has
a dense orbit.
Proof: Let U be a nonempty open set.
Ω−∪i≥0T−iU is T-invariant, and since it
excludes U, it does not have the full measure. Since T is ergodic,
the set actually has zero measure.
Now, ∪(Ω−∪T−iU) is a union of countably many
zero-measure sets, so it still has zero measure. By expanding the
definition, this is the set of all points with non-dense orbit.
Finally, there is a common theme in ergodic theory. There are rigorous
versions of it, but instead of going for rigor, the spirit is more
important:
Theorem (ergodic decomposition): Any interesting map is a partition/sum/integral of ergodic maps.
For example, the shear map on the unit square [0,1]2 defined by
(x,y)↦(x,x+ymod1)
can be thought of as an integral over rotations: For each
x∈[0,1], we have Tx:y↦x+ymod1. For almost all
x∈[0,1], we have Tx an irrational
rotation, thus
ergodic.
Sequence POV
We must interpret the language of measure theory, which is dead like
chalk dust, back into the language of sequence predictions, which is
alive like reinforced concrete.
Each point in the state space X∈Ω is a text: a stream of
tokens infinite both forwards and backwards. The state space Ω is
the all possible texts (Xn)n. We assume that all tokens come from
the same finite-size alphabet, for example, the 128 ASCII symbols.
The shift map on the state space T:Ω→Ω is defined by
moving the origin to the right by one:
T(…,X−1,X0,X1,…):=(…,X0,X1,X2,…)
The shift map is measure-preserving, meaning that the process is
stationary: We could have started reading at any point, and we would
still expect to see the same kind of probability distribution. It would
not be like “Sorry, the word ‘cat’ appears with zero probability when
n≥1000.”. It would be like “No matter where we start reading, we
should expect to the first three tokens to be ‘cat’ with probability
10−4.”.
Repeatedly applying the shift map T is just reading through the
stream, one token at a time:
...Lorem ipsum ...↦...orem ipsum d...↦...rem ipsum do...↦⋯
A periodic point of T is a text that repeats itself like a broken
record. For example, X:=... and and and ... satisfies
T4X=X.
A T-invariant set S⊂Ω is a set of texts, such that if we
take any text X from S, and jump either forwards or backwards for an
arbitrary amount, we get another set in S. In other words, S is a
set of token streams where there is no origin: you can start reading
from any token.
A probability distribution over Ω describes the probability of
observing various kinds of text streams.
If we can partition Ω into two subsets P,Q, with probabilities
ϵ>0,1−ϵ>0, then it means that any text from P is
different from any text from Q, after any shift. It is as if there are
two languages, and each text can be exclusively written in one language
only.
We wish to consider only texts created by some imaginary “universal
English speaker”. In particular, we do not want it to get stuck in one
sub-language of English, then never escape from it. That is, we assume
the universal speaker is ergodic.
Now imagine that we randomly sample two pieces of text generated by the
universal speaker, and we shift the first text around to match it
against the second. By @thm-ergodic-dense-orbit, the orbit of the first
text is dense in the space of all possible English texts spoken by the
universal speaker. We can gamify this situation thus:
Prover: “I take one piece of text x, then another piece x′.”.
Challenger: “I challenge you to find a stretch of text from x that
matches the −1000:1000 stretch in x′.”.
Challenger verifies that
T49134819(x)−1000:1000=(x′)−1000:1000.
Shannon—McMillan—Breiman
If someone has created an infinite sequence of coin flips
X−∞:+∞, then revealed it to us one by one, then each
reveal would give us 1bit=ln2nat. The long-term
average obtained per reveal is still ln2nat, a rather boring
situation.
How do we measure the entropy of an English speaker? It speaks token by
token, and we have to measure the average information we obtain per
token. The problem is that there are two senses of “average”. It could
be the time-average: we listen to the speaker speak for a very long
time, and calculate the entropy in the speech. It could be the
ensemble-average: we listen to the speaker speak for a very long time,
then do it again, then again, etc, then average together the
time-averages.
If the speaker is ergodic, then the speaker essentially has just one
speech, and any two samples of its speech are just translations of each
other. Consequently, it is intuitively clear that with probability 1,
the time-average of the entropy of one speech equals the
ensemble-average of the entropy of all speeches. Intuitively, with
probability 1,
In textbooks and Wikipedia, the SMB theorem is stated rigorously, but
you have already understood the idea of SMB, and the rigorous versions
are simply paraphrases of the idea.
The thing is released in a scattered way, typical for an
internet-native publication. There is the report
[@barnettScalingTransformativeAutoregressive2023], in the form of
a paper—clearly meant to be cited, despite being hard to read.
There is the website [@barnettDirectApproach2023], in the form of
a blog post—clearly meant to be read, despite not being
upper-class enough to be cited in journal papers. Finally there is
the interactive
model
which looks like an optional add-on to the blog post.
In short, an ergodic speaker is someone who has only one speech.
If you hear it speak once for a very long time, then hear it speak
again for a very long time, then you can take the first and shift it
around, so that it looks like the second over a very long
sub-segment. Ergodic speakers allow you to take the average over a
single very long speech, and be assured that it is close to the
average over all possible speeches.
It still works even if the humans are pseudorandom. We just have
to whisper the same RNG seed into both humans’ ears, and then they
would behave in the same pseudorandom way.
The simplest counterexample: Suppose the source is binary, and
satisfies Xn+1=Xn+1mod2, so it has zero entropy.
Nevertheless, the human intentionally guesses wrong the first time.
Therefore, we have p2=1, and we have violated the lower bound
by 2ln2>0.
This source can be made ergodic by adding an ϵ amount of
coin-flip noise: Xn+1=Xn+1mod2 with probability
1−ϵ. This would still give us
2ln2+O(ϵ)>O(ϵlnϵ).
The possibilities of developing an atomic weapon and the
desirability of doing it secretly were discussed at a Princeton
University conference in which I participated in March 1939…
Bohr said this rare variety could not be separated from common
uranium except by turning the country into a gigantic factory.
Bohr was worried that this could be done and that an atomic bomb
could be developed—but he hoped that neither could be
accomplished. Years later, when Bohr came to Los Alamos, I was
prepared to say, “You see...” But before I could open my mouth, he
said: “You see, I told you it couldn’t be done without turning the
whole country into a factory. You have done just that.”
[@tellerLegacyHiroshima1975]
Only in a life-or-death situation does 50% of GDP get devoted to
one purpose. For example, that is about the level of GDP devoted to
war production during WWII in the major combatant countries. The USA
spent 4 trillion USD2011 over 6 years out of an annual GDP of 1.3
trillion USD2011.
Except pathological examples constructed by logicians who have
nothing better to do than to care about the continuum hypothesis,
large cardinals, and the arithmetic hierarchy. Those who desire the
rigor-mortis of logic, let them have it.
Predicting AGI by the Turing Test
Link post
Abstract
This essay explains the Direct Approach proposed by [@barnettScalingTransformativeAutoregressive2023].[1] I encourage you to play with the Direct Approach Interactive Model to explore an interactive simulation using the approach.
From the POV of the judge, a Turing test is a sequential test for two statistical hypotheses—“is human” and “is machine”. Under reasonable assumptions, halving the (reducible part of) log-perplexity loss of the language model would double the time it can survive in a Turing test.
We can think of the peer-review of scientific papers as a Turing test, and say that AGI has arrived when we have AI scientists that can pass the papers peer-review. This allows us to calculate the log-perplexity loss of the first AGI. If we assume it is just a scaled-up GPT, then assuming the Chinchilla scaling law, it would cost about 200 years of global GDP. This makes it virtually certain that the first AGI will not be a scaled-up GPT.
Turing test as statistical hypothesis test
Turing test
In the Turing test, there are three players: one judge and two players. The first player is a human, and the second is a machine. The judge asks each player text questions and receives text answers. The judge must decide who is the human.
We consider a simplified Turing test. In this test, the judge does not ask, and simply receives one stream of text X1:∞. The judge must decide whether the stream is produced by the human or the machine, and do so quickly.
Cast in the language of statistical hypothesis testing, we have two hypotheses:
H0: “the stream is produced by the human”;
H1: “the stream is produced by the machine”.
The judge would read from the stream X1:∞,
o-n-e- -t-o-k-e-n
at a time, and at each token, decide whether to take another one, or announce its judgment: H0 or H1.As the organizers of the Turing test, we would start the test by flipping a fair coin to decide whether to use the human or the machine. Therefore, Pr(H0)=Pr(H1), and by Bayes, the posterior log-probability ratio is
lnPr(X1:n|H0)Pr(X1:n|H1)=lnPr(H0|X1:n)Pr(H1|X1:n)
This allows us to use the sequential probability ratio test (SPRT). The judge would decide on two decision boundaries, and calculate lnPr(X1:n|H0)Pr(X1:n|H1) at each token. It would stop and announce the decision as soon as the quantity exceeds one of the boundaries.
For example, suppose the judge wants to decide when the odds ratio is 10 to 1, then it would set the decision boundaries to be [−ln10,+ln10]. If lnPr(X1:n|H0)Pr(X1:n|H1) goes above +ln10 when n=60, then the judge would announce “H0” at that point.
The ln10 is a good rule of thumb, which we will use for the remainder of the essay.
Sequential hypothesis testing
Consider the following simple equation:
1nEX∼Pr(⋅|H0)[lnPr(X1:n|H0)Pr(X1:n|H1)]1nDKL(Pr(⋅|H0)∥Pr(⋅|H1))=1nEX∼Pr(⋅|H0)[ln1Pr(X1:n|H1)]negative log-likelihood loss per token−1nEX∼Pr(⋅|H0)[1lnPr(X1:n|H0)]entropy rate of the human itself{#eq-sprt}
The first term is the KL-divergence per token between the machine and the human. Roughly speaking, it is how different they are, per token emitted. It is an information-theoretic quantity.
The second term is negative log-likelihood loss per token. This is what language models are trained to minimize. We write it as L.
The third term is the entropy rate of the human. It is how random the human is. We write it as L∞, because it is the theoretical minimal loss that the language model can reach.
If the machine is a perfect replica of the human, then the second term is zero, and the first term equals the third term.
Assuming that the human is an ergodic speakers of English,[2] we can sample an infinite stream X1:∞ from the human, then call up the Shannon—McMillan—Breiman theorem and find that
1nlnPr(X1:n|H0)Pr(X1:n|H1)→L−L∞
On the other hand, if the machine is also an ergodic speaker of English, then we can sample an infinite stream X1:∞ from the machine, then call up the SMB theorem and find that
1nlnPr(X1:n|H1)Pr(X1:n|H0)→L′−L′∞
where unfortunately, we have the odd L′ and L′∞, defined by
L′:=limn1nEX∼Pr(⋅|H1)[ln1Pr(X1:n|H0)],L′∞:=limn1nEX∼Pr(⋅|H1)[ln1Pr(X1:n|H1)]
We can interpret them as the loss of the human at imitating the machine, and the entropy rate of the machine itself. When the machine is close enough to the human, we can take the approximation L′≈L,L′∞≈L∞.
Now, define the log-ratio at step n to be rn:=Pr(X1:n|H0)Pr(X1:n|H1). During a Turing test, the judge calculates
r0=1r1=r0+Pr(X1:1|H0)Pr(X1:1|H1)r2=r1+Pr(X1:2|X1:1,H0)Pr(X1:2|X1:1,H1)⋯
So, imagine that such a perfect judge is going through a Turing test, upon receiving “my cat is technically”, and we are listening on its thoughts:
“If it were a human, then it would start with ‘my’ with probability 0.01. If it were a machine, then 0.05. Therefore, the odds ratio is 2 to 1.”
“If it were a human, then it would follow ‘my’ with ‘cat’ with probability 0.01. If it were a machine, then 0.033. Therefore, the odds ratio is 3 to 1.”
“If it were a human, then it would follow ‘is’ with ‘my cat’ with probability… I do not know. However, I do know that the odds ratio is 2 to 1. Now that the total odds ratio is 12 to 1, I can decide: H0.”
We see that the judge does not have to know the probabilities Pr(X1:n|H0) and Pr(X1:n|H1), only their ratio. This might be a minor point, but this idea of likelihood ratio is quite important. It is like “I don’t know how often you say ‘cat’ but I know that you say it twice as often than I do!”.
Let T∗ be the time it takes for the judge to decide.
T∗≈ln10L−L∞
Intuitively, each token on average moves the log-probability-ratio away from 0 by another (L−L∞). Decision is triggered when it finally moves out of the interval [−ln10,+ln10].
We are not able to simply look at a few tokens, draw a straight line, and call it a day, because the trajectory of log-probability-ratio is much closer to a random walk with drift. Subjectively, if you were a judge and watching the log-probability-ratio moving, you’d see ups and downs, keeping you in suspense, until it finally crosses the decision boundaries.
Slowdown factor
To perform the SPRT as described, the judge must know intimately the difference between a human and a machine. Can the judge do that? Can anyone know, with certainty, that I would start my speech with “Forty cats …” with a probability that is exactly 32.42 times that of GPT-3?
As a crude approximation, we can model real-world judges as slowed-down version of the perfect judge. We can imagine that at each step, instead of updating the log-ratio by
lnrn+1←lnrn+lnPr(Xn+1|H0,X1:n)Pr(Xn+1|H1,X1:n)
we update it by
lnrn+1←lnrn+1slnPr(Xn+1|H0,X1:n)Pr(Xn+1|H1,X1:n)
where s>1 is the slowdown factor. This implies that if it takes ∼T tokens for the perfect judge to reach a likelihood ratio of r, it would take ∼sT tokens for a human judge.
Measuring the slowdown factor
The slowdown factor s is unknown.
The original paper [@barnettScalingTransformativeAutoregressive2023] contains no estimate of s. They did propose to measure it experimentally by running the Turing test with a human judge and two language models. One model H0 “perfectly imitates humans” by simply sampling a random text segment from a corpus, and the other model H1 is a trained language model, finetuned to imitate the same corpus. They claimed that for any piece of text X1:n, they can calculate the log-ratio lnPr(X1:n|H0)Pr(X1:n|H1), but I found it difficult: Suppose X1:n= technically fork, which is unlikely but possible, yet the phrase never appears in the corpus, what should be Pr(X1:n|H0)? We can use one of the many smoothing tricks [@jurafskySpeechLanguageProcessing2023, chapter 3], but this gets complicated.
What I think would work well is if both H0and H1 are language models, perhaps even the same model with different sampling temperatures, then the human judge only has to distinguish the two models.
Perhaps we can make this into a gambling game. The human subject would be presented with two long outputs from two hidden Markov models. Then the subject becomes the judge of a Turing test: “Are you seeing the output from machine 0 or machine 1?”. At each step, the subject can either pay a few cents of fake money to see another character, or stop and make a bet with the entire bankroll: “I bet 70% of my bankroll on machine 0 and the rest on machine 1!”. Both bets have payoff odds of 2:1. I believe that if the cost of seeing another character is just right, the subject would be nudged to make a decision at exactly 10:1 posterior odds ratio on the two hypotheses “machine 0” and “machine 1″.
There was one large-scale attempt at the Turing test in early 2023, in a game called “Human or Not?” [@jannaiHumanNotGamified2023]. Human participants took 2-minute conversations, and at the end, had to decide whether they were talking to a human or a bot.[3]
I counted that during a typical message, each side sends [20,40] English words in total, or [30,50] tokens. In [60%,70%] of trials, the human participant judged correctly. This suggests that the log-ratio achieved after [30,50] tokens is around the range of [±ln6/4,±ln7/3]. In other words, the average log-ratio per token is
[ln6/4,ln7/3][30,50]∈[0.01,0.03]nat/token
They used several different AI, ranging between Jurassic-2, GPT-4, and Cohere. None of them have published their training compute or loss curves. The only good estimate is for GPT-4, which has training cost C=2×1025FLOP.
Assuming that Chinchilla scaling holds, average log-ratio per token that an ideal judge should achieve is L−L∞=1070C0.154=0.14nat/token. Therefore,
s∈[5,14]
I did not expect the estimate to be nearly symmetric around 10.
Entropy of natural languages
In @eq-sprt, we argued that L∞ should be interpreted as the entropy rate of the source, usually human-generated English. Unfortunately, unlike that of coin flips or Markov chains, the entropy rate of English cannot be calculated, only estimated. Fortunately, it can be estimated in several ways, and we can check their agreement.
Since tokenizers are temporary, but English is permanent, we convert all units to bit/character for easy comparison.
Chinchilla scaling
In the Chinchilla scaling law paper, the authors trained many language models with various sizes from a single architecture family, and fitted a statistical law to the data, giving L∞=1.69nat/token (without error bars, unfortunately) [@hoffmannTrainingComputeOptimalLarge2022, page 25].
To find the effective bit/character for the Chinchilla scaling law, we need to convert nat to bit, and token to character. The first is easy: 1bit=ln(2)nat. The second can be estimated by running a tokenizer over a large natural English corpus. I have estimated this by running the GPT-2 tokenizer on the
WikiText-2
corpus, and found that on average,1token=4.5character=0.85word
Thus, L∞≈1.694.5×ln2=0.54bit/character.
Guessing game
The earliest attempt to measure the entropy rate of English is by Shannon himself [@shannonPredictionEntropyPrinted1951]: [0.6,1.3]bit/character. He obtained the estimate by presenting human subjects n−1 characters from a text, and ask them to guess the next character repeatedly, until they got it right. In this case, the optimal strategy is to construct the n-gram table, and pick the argmax character for the given (n−1)-gram, then the arg-next-max, and so on.
Let N be the total number of characters allowed—Shannon’s experiment used N=27, with 26 lowercase letters and one white space. Let pk be the frequency that the subject makes exactly k guesses -- including the correct guess, so that ∑Nk=1pk=1. By convention, pN+1:=0. Shannon derived both an upper and a lower bound for the entropy per character:
N∑k=1k(pk−pk+1)lnk≤H≤−N∑k=1pklnpk
The upper bound is proved by Shannon’s source coding theorem. Taking a human subject, copy it, then they can be used as an encoder-decoder pair.[4] The lower bound is not only tricky to prove, but also wrong in general. It is only correct when the human subject is the optimal N-gram predictor.[5] Because of this, I do not recommend using this lower bound, but will quote it anyway.
Over the years, others devised other methods to estimate this entropy. For example, [@coverConvergentGamblingEstimate1978] used a gambling game estimation, in the style of the Kelly criterion. Subjects were required to divide their entire bankroll into 27 differently-sized bets over 27 possibilities (26 letters and 1 whitespace). The right bet pays back 27-fold, and the other bets are lost. Let Sn be the size of bankroll after n rounds of betting, then
H≤ln27−limsupn1nlnSn
They found that H≤1.3bit/character.
The guesser does not have to be a human. It can very well be a language model. [@brownEstimateUpperBound1992] made a simple trigram model over the Brown corpus (600 million words), and found that it gives H≤1.75bit/character. [@behrjrEstimatingComparingEntropy2002] used a model that combines multiple n-gram models, giving H≤1.46bit/character.
Lossless compression
Another way to estimate is by lossless compression of a large corpus, since the entropy rate is the lower bound on compression rate. In more detail, if you have a source of information emitting symbols, and its symbol stream has an entropy rate of xbit/symbol, then it takes at least ∼xl bits to encode a long segment with l symbols. Furthermore, this lower bound is approachable using the entropy encoding.
The Hutter prize is a competition for compressing a 109-byte corpus from the English Wikipedia (
enwik9
). For the size of the finished product, both the algorithm and the compressed data must be counted. In particular, if a neural network is used, then the size of the neural network weights must be counted as well.The
enwik9
dataset is inXML
format, and thus contains a lot of non-English content like<timestamp>2005-12-27T18:46:47Z</timestamp>
. It has 109 bytes. It is tricky to decide how to clean it up to remove all theXML
formatting. As a simple estimate, we counted its words and characters directly with Linux commandwc
without any preprocessing, which gives us13,147,025 words=129,347,857 characters=1,000,000,000 bytes
Therefore, the entropy rate is
8×108/13,147,025compression ratio=6.15compression ratiobit/character{#eq-hutter-prize-entropy-rate}
The standard zip algorithm can compress it down to about 300 Mb in size, a compression ratio of ∼3×. Over the years, the progress has been slow but somewhat steady. The current winning entry (Saurabh Kumar, 2023) has a compression ratio of 8.76×. If we extrapolate the prize-winning entries over the years, it seems that the best possible compression ratio is ∼10×.
Similar to the Hutter prize, the Large Text Compression Benchmark also asks for compressing the
enwik9
dataset. However, there is no limit to the algorithm runtime or size, so the compression ratio for this benchmark is always higher. Currently (2024-01-19), the maximal compression rate reached is 9.35× withnncp v3.2
, which uses a small Transformer model.[@grassbergerDataCompressionEntropy2002] used a substitutional compression algorithm with increasingly large codebooks. When the codebook had 6000 codes, the algorithm gave h≤1.82bit/character. By extrapolating the {codebook size}-{entropy rate} curve to an infinitely large codebook, they estimated that English has entropy rate 0.7±0.2bit/character.
Summary
nncp v3.2
, 2023)Notably, the above table has mostly upper bounds, and only one dubious lower bound (by Shannon) from 1951. Perhaps lower bounds can be established by using randomness extractors on a large corpus, and checking that the output from the extractor passes pseudorandomness tests.
Forecasting AGI
According to the Chinchilla scaling law [@hoffmannTrainingComputeOptimalLarge2022], if we have a fixed amount of computing budget C, by choosing the model and dataset size correctly, the minimal reducible loss achievable is
L−L∞=1070(C/FLOP)0.154nat/token{#eq-chinchilla-scaling}
Assuming a slowdown factor s, that the judge decides when the odds ratio is r:1, and the Chinchilla scaling law, we have a direct method to predict how long a language model can survive in a Turing test, according to the cost of training compute C:
T∗∼slnr1070(C/FLOP)0.154token
This gives, as a rule of thumb, 100× compute means 2× length of survival in a Turing test.
For example, assuming a slowdown factor of s=10, and that the judge decides when the odds ratio is 10:1, for a language model to survive for 1000 tokens, it needs
L−L∞≤10×ln10/1000=0.023nat/token
If GPT-4 costs 2×1025FLOP in compute, and 1word≈1.2token, then
T∗≈170 tokens≈150 words
meaning it has a good chance of passing the Turing test if limited to only 150 words. For context, the Attention is All You Need paper has an abstract that’s 200 tokens long.
A typical scientific paper is about 4000 words long, which is 27× that of 150 words, so it would need 271/0.153=(2×109)× that of compute. Assuming that GPT-4 cost 10 million USD to train, this hypothetical AI would cost 2×1016 USD, or 200 years of global GDP2023.
This implies that the first AGI will not be a scaled-up GPT -- autoregressive transformer generatively pretrained on a lightly filtered text dataset. It has to include something else, perhaps multimodal data, high-quality data, better architecture, etc. Even if we were to attempt to merely scale it up, turning earth into a GPT-factory,[6] with even 50% of global GDP devoted,[7] and with 2% growth rate forever, it would still take 110 years,[8] arriving at year 2133. Whole brain emulation would likely take less time.[9]
Appendix: Ergodic theory
Since we used ergodic theory during the essay, we should quickly explain what it is about. This section is foundational, but the full complexity is not necessary.
Measure-theoretic POV
I know, you know too, nobody really likes measure theory any more than pianists like practicing scales hundreds of times. Still, it is at the right level of abstraction for many theories, including probability.
We omit all mentions of “almost-everywhere”, “except on a set of measure zero”, and similar annoying phrases. As long as you never make a union of uncountable many subsets, you will not be hurt by this omission.
A probability space is a measurable space with a measure of 1. We write it as (Ω,B,Pr), where B is the sigma-algebra of measurable sets, and Pr is the probability measure. We also write μ for the measure.[10]
We consider a single measurable function T:Ω→Ω, and call it the shift map.
We demand that T must preserve measure. That is, ∀S∈B, we have Pr(T−1(S))=Pr(S).
A subset is measurable iff it is an element of B. A measurable set is also called an event.
A subset S∈B is T-invariant iff T−1(S)=S almost everywhere.[11] Let I be the set of all T-invariant subsets:
I:={S∈B:T−1(S)=S}
Now, obviously any set of measure zero or one are T-invariant. We say that those are trivially T-invariant. We say that T is ergodic iff I has only such trivial subsets. In other words, T is ergodic iff it cannot be factored into two nontrivial chunks:
S,S′ partitions Ω,such that T−1(S)=S,T−1(S′)=S′,Pr(S)>0,Pr(S′)>0
We usually ask T to also be ergodic, though sometimes we don’t need that.
Ergodic maps have many very good properties. We will use the following one. For the theorem, you can picture it as the real space Rn with the gaussian probability distribution, but in fact, it applies for just about everything we would care about, such as the space of English texts, queuing jobs, random walks, etc.[12]
Theorem: If the state space is a topological space with a countable basis, and any nonempty open set has positive measure, then almost any X∈Ω has a dense orbit.
Proof: Let U be a nonempty open set.
Ω−∪i≥0T−iU is T-invariant, and since it excludes U, it does not have the full measure. Since T is ergodic, the set actually has zero measure.
Now, ∪(Ω−∪T−iU) is a union of countably many zero-measure sets, so it still has zero measure. By expanding the definition, this is the set of all points with non-dense orbit.
Finally, there is a common theme in ergodic theory. There are rigorous versions of it, but instead of going for rigor, the spirit is more important:
Theorem (ergodic decomposition): Any interesting map is a partition/sum/integral of ergodic maps.
For example, the shear map on the unit square [0,1]2 defined by
(x,y)↦(x,x+ymod1)
can be thought of as an integral over rotations: For each x∈[0,1], we have Tx:y↦x+ymod1. For almost all x∈[0,1], we have Tx an irrational rotation, thus ergodic.
Sequence POV
We must interpret the language of measure theory, which is dead like chalk dust, back into the language of sequence predictions, which is alive like reinforced concrete.
Each point in the state space X∈Ω is a text: a stream of tokens infinite both forwards and backwards. The state space Ω is the all possible texts (Xn)n. We assume that all tokens come from the same finite-size alphabet, for example, the 128 ASCII symbols.
The shift map on the state space T:Ω→Ω is defined by moving the origin to the right by one:
T(…,X−1,X0,X1,…):=(…,X0,X1,X2,…)
The shift map is measure-preserving, meaning that the process is stationary: We could have started reading at any point, and we would still expect to see the same kind of probability distribution. It would not be like “Sorry, the word ‘cat’ appears with zero probability when n≥1000.”. It would be like “No matter where we start reading, we should expect to the first three tokens to be ‘cat’ with probability 10−4.”.
Repeatedly applying the shift map T is just reading through the stream, one token at a time:
...Lorem ipsum ...↦...orem ipsum d...↦...rem ipsum do...↦⋯
A periodic point of T is a text that repeats itself like a broken record. For example, X:=... and and and ... satisfies T4X=X.
A T-invariant set S⊂Ω is a set of texts, such that if we take any text X from S, and jump either forwards or backwards for an arbitrary amount, we get another set in S. In other words, S is a set of token streams where there is no origin: you can start reading from any token.
A probability distribution over Ω describes the probability of observing various kinds of text streams.
If we can partition Ω into two subsets P,Q, with probabilities ϵ>0,1−ϵ>0, then it means that any text from P is different from any text from Q, after any shift. It is as if there are two languages, and each text can be exclusively written in one language only.
We wish to consider only texts created by some imaginary “universal English speaker”. In particular, we do not want it to get stuck in one sub-language of English, then never escape from it. That is, we assume the universal speaker is ergodic.
Now imagine that we randomly sample two pieces of text generated by the universal speaker, and we shift the first text around to match it against the second. By @thm-ergodic-dense-orbit, the orbit of the first text is dense in the space of all possible English texts spoken by the universal speaker. We can gamify this situation thus:
Prover: “I take one piece of text x, then another piece x′.”.
Challenger: “I challenge you to find a stretch of text from x that matches the −1000:1000 stretch in x′.”.
Prover asks a team of immortal monkeys to do the task. A million years later: “At 49134819.”.
Challenger verifies that T49134819(x)−1000:1000=(x′)−1000:1000.
Shannon—McMillan—Breiman
If someone has created an infinite sequence of coin flips X−∞:+∞, then revealed it to us one by one, then each reveal would give us 1bit=ln2nat. The long-term average obtained per reveal is still ln2nat, a rather boring situation.
How do we measure the entropy of an English speaker? It speaks token by token, and we have to measure the average information we obtain per token. The problem is that there are two senses of “average”. It could be the time-average: we listen to the speaker speak for a very long time, and calculate the entropy in the speech. It could be the ensemble-average: we listen to the speaker speak for a very long time, then do it again, then again, etc, then average together the time-averages.
If the speaker is ergodic, then the speaker essentially has just one speech, and any two samples of its speech are just translations of each other. Consequently, it is intuitively clear that with probability 1, the time-average of the entropy of one speech equals the ensemble-average of the entropy of all speeches. Intuitively, with probability 1,
1nlnPr(X1:n)→E[1nlnPr(X1:n)]
For non-ergodic speakers. We simply decompose the speaker into an ensemble of ergodic speakers, then apply the SMB theorem to each one. It is like the strong law of large numbers. Intuitively, with probability 1,
1nlnPr(X1:n|X is type i)→E[1nlnPr(X1:n)|X is type i]
This is the Shannon—McMillan—Breiman theorem.
In textbooks and Wikipedia, the SMB theorem is stated rigorously, but you have already understood the idea of SMB, and the rigorous versions are simply paraphrases of the idea.
The thing is released in a scattered way, typical for an internet-native publication. There is the report [@barnettScalingTransformativeAutoregressive2023], in the form of a paper—clearly meant to be cited, despite being hard to read. There is the website [@barnettDirectApproach2023], in the form of a blog post—clearly meant to be read, despite not being upper-class enough to be cited in journal papers. Finally there is the interactive model which looks like an optional add-on to the blog post.
In short, an ergodic speaker is someone who has only one speech. If you hear it speak once for a very long time, then hear it speak again for a very long time, then you can take the first and shift it around, so that it looks like the second over a very long sub-segment. Ergodic speakers allow you to take the average over a single very long speech, and be assured that it is close to the average over all possible speeches.
In long, see the appendix on ergodic theory.
There was no mention of whether the bots had to decide the same question.
It still works even if the humans are pseudorandom. We just have to whisper the same RNG seed into both humans’ ears, and then they would behave in the same pseudorandom way.
The simplest counterexample: Suppose the source is binary, and satisfies Xn+1=Xn+1mod2, so it has zero entropy. Nevertheless, the human intentionally guesses wrong the first time. Therefore, we have p2=1, and we have violated the lower bound by 2ln2>0.
This source can be made ergodic by adding an ϵ amount of coin-flip noise: Xn+1=Xn+1mod2 with probability 1−ϵ. This would still give us 2ln2+O(ϵ)>O(ϵlnϵ).
Consider this anecdote from Edward Teller:
Only in a life-or-death situation does 50% of GDP get devoted to one purpose. For example, that is about the level of GDP devoted to war production during WWII in the major combatant countries. The USA spent 4 trillion USD2011 over 6 years out of an annual GDP of 1.3 trillion USD2011.
Solve for x in 200=∑xk=00.5×1.02k.
<iframe src=”https://www.metaculus.com/questions/question_embed/2813/?theme=dark″ style=”height:430px; width:100%; max-width:550px”></iframe>
Pronounced “mu”—it is a pun because both “mu” and “measure” starts with “m”.
That is, except on a subset of measure zero: Pr(T−1(S)−S)=0 and Pr(S−T−1(S))=0. This is the last time we will measure this.
Except pathological examples constructed by logicians who have nothing better to do than to care about the continuum hypothesis, large cardinals, and the arithmetic hierarchy. Those who desire the rigor-mortis of logic, let them have it.