A lot of discussion of intelligence considers it as a scalar value that measures a general capability to solve a wide range of tasks. In this conception of intelligence it is primarily a question of having a ′ good Map’ . This is a simplistic picture since it’s missing the intrinsic limits imposed on prediction by the Territory. Not all tasks or domains have the same marginal returns to intelligence—these can vary wildly.
Let me tell you about a ‘predictive efficiency’ framework that I find compelling & deep and that will hopefully give you some mathematical flesh to these intuitions. I initially learned about these ideas in the context of Computational Mechanics, but I realized that there underlying ideas are much more general.
Let X be a predictor variable that we’d like to use to predict a target variable Y under a joint distribution p(x,y). For instance X could be the contex window and Y could be the next hundred tokens, or X could be the past market data and Y is the future market data.
In any prediction task there are three fundamental and independently varying quantities that you need to think of:
H(Y∣X) is the irreducible uncertainty or the intrinsic noise that remains even when X is known.
E=I(X;Y)=H(Y)−H(Y∣X), quantifies the reducible uncertainty or the amount of predictable information contained in X.
For the third quantity, let us introduce the notion of causal states or minimally sufficient statistics. We define an equivalence relation on X by declaring
x∼x′if and only ifp(Y∣x)=p(Y∣x′).
The resulting equivalence classes, denoted as c(X), yield a minimal sufficient statistic for predicting Y. This construction is ``minimal″ because it groups together all those x that lead to the same predictive distribution p(Y∣x), and it is ``sufficient″ because, given the equivalence class c(x), no further refinement of X can improve our prediction of Y.
From this, we define the forecasting complexity (or statistical complexity) as
C:=H(c(X)),
which measures the amount of information—the cost in bits—to specify the causal state of X. Finally, the \emph{predictive efficiency} is defined by the ratio
η=EC,
which tells us how much of the complexity actually contributes to reducing uncertainty in Y. In many real-world domains, even if substantial information is stored (high C), the gain in predictability (E) might be modest. This situation is often encountered in fields where, despite high skill ceilings (i.e. very high forecasting complexity), the net effect of additional expertise is limited because the predictive information is a small fraction of the complexity.
Example of low efficiency.
Let X∈{0,1}100 be the outcome of 100 independent fair coin flips, so each x has H(X)=100 bits.
Define Y∈{0,1} as a single coin flip whose bias is determined by the proportion of heads in X. That is, if x has k heads then: p(Y=1∣x)=k100,p(Y=0∣x)=1−k100
Total information in YH(Y): \\ When averaged over all possible X, the mean bias is 0.5 so that Y is marginally a fair coin. Hence, H(Y)=1 bit
Conditional Entropy or irreducible uncertaintyH(Y∣X): \\ Given X, the outcome Y is drawn from a Bernoulli distribution whose entropy depends on the number of heads in X. For typical X (around 50 heads), H(Y∣x)≈1 bit; however, averaging over all X yields a slightly lower value. Numerically, one finds: H(Y∣X)≈0.98 bits.
Predictable Information E=I(X;Y): \\ With the above numbers, the mutual information is E=H(Y)−H(Y∣X)≈1−0.98=0.02 bits.
Forecasting ComplexityC=H(c(X)): \\ The causal state construction groups together all sequences x with the same number k of heads. Since k∈{0,1,...,100}, there are 101 equivalence classes. The entropy of these classes is given by the entropy of the binomial distribution Bin(100,0.5). Using an approximation: C≈12log2(2πe(1004))=12log2(2πe⋅25)≈12log2(427)≈4.37 bits.
Predictive Efficiency η: η=EC≈0.024.37≈0.0046.
In this example, a vast amount of internal structural information (the cost to specify the causal state) is required to extract just a tiny bit of predictability. In practical terms, this means that even if one possesses great expertise—analogous to having high forecasting complexity —the net benefit is modest because the inherent η (predictive efficiency) is low. Such scenarios are common in fields like archaeology or long-term political forecasting, where obtaining a single predictive bit of information may demand enormous expertise, data, and computational resources.
I cannot comment on the math, but intuitively this seems wrong.
Zagorsky (2007) found that while IQ correlates with income, the relationship becomes increasingly non-linear at higher IQs and suggests exponential rather than logarithmic returns.
Sinatra et al. (2016) found that high-impact research is produced by a small fraction of exceptional scientists, significantly exceeding their simply above-average peers.
My understanding is that empirical evidence points toward power law distributions in the relationship between intelligence and real-world impact, and that intelligence seems to broadly enable exponentially improving abilities to modify the world in your preferred image. I’m not sure why this is.
The most straightforward explanation would be that there are more underexploited niches for top-0.01%-intelligence people than there are top-1%-intelligence people.
Dunno if this is meant to be inspired by/a formalization of [my previous position against intelligence](https://www.lesswrong.com/posts/puv8fRDCH9jx5yhbX/johnswentworth-s-shortform?commentId=jZ2KRPoxEWexBoYSc). But if it is meant to be inspired by it, I just want to flag/highlight that this is the opposite of my position because I’d say intelligence does super well on this hypothetical task because it can just predict 50/50 and be nearly optimal. (Which doesn’t imply low marginal return to intelligence because then you could go apply the intelligence to other tasks.) I also think it is extremely intelligent [perjorative] of you to say that this sort of thing is common in archaeology and political forecasting.
People read more into this shortform than I intended. It is not a cryptic reaction, criticism, or reply to/of another post.
Ah, fair enough! I just thought given the timing, it might be that you had seen my post and thought a bit about the limitations of intelligence.
I don’t know what you mean by intelligent [pejorative] but it sounds sarcarcastic.
The reason I call it intelligent is: Intelligence is the ability to make use of patterns. If one was to look for patterns in intelligent political forecasting and archaeology, or more generally patterns in the application of intelligence and in the discussion of the limitations of intelligence, then what you’ve written is a sort of convergent outcome.
It’s [perjorative] because it’s bad.
To be clear, the low predictive efficiency is not a dig at archeology. It seems I have triggered something here.
Whether a question/domain has low or high (marginal) predictive effiency is not a value judgement, just an observation.
I mean I’m just highlighting it here because I thought it was probably a result of my comments elsewhere and if so I wanted to ping that it was the opposite of what I was talking about.
If it’s unrelated then… I don’t exactly want to say “carry on” because I still think it’s bad, but I’m not exactly sure where to begin or how you ended up with this line of inquiry, so I don’t exactly have much to comment on.
The Marginal Returns of Intelligence
A lot of discussion of intelligence considers it as a scalar value that measures a general capability to solve a wide range of tasks. In this conception of intelligence it is primarily a question of having a ′ good Map’ . This is a simplistic picture since it’s missing the intrinsic limits imposed on prediction by the Territory. Not all tasks or domains have the same marginal returns to intelligence—these can vary wildly.
Let me tell you about a ‘predictive efficiency’ framework that I find compelling & deep and that will hopefully give you some mathematical flesh to these intuitions. I initially learned about these ideas in the context of Computational Mechanics, but I realized that there underlying ideas are much more general.
Let X be a predictor variable that we’d like to use to predict a target variable Y under a joint distribution p(x,y). For instance X could be the contex window and Y could be the next hundred tokens, or X could be the past market data and Y is the future market data.
In any prediction task there are three fundamental and independently varying quantities that you need to think of:
E=I(X;Y)=H(Y)−H(Y∣X), quantifies the reducible uncertainty or the amount of predictable information contained in X.
For the third quantity, let us introduce the notion of causal states or minimally sufficient statistics. We define an equivalence relation on X by declaring
x∼x′if and only ifp(Y∣x)=p(Y∣x′).
The resulting equivalence classes, denoted as c(X), yield a minimal sufficient statistic for predicting Y. This construction is ``minimal″ because it groups together all those x that lead to the same predictive distribution p(Y∣x), and it is ``sufficient″ because, given the equivalence class c(x), no further refinement of X can improve our prediction of Y.
From this, we define the forecasting complexity (or statistical complexity) as
C:=H(c(X)),
which measures the amount of information—the cost in bits—to specify the causal state of X. Finally, the \emph{predictive efficiency} is defined by the ratio
η=EC,
which tells us how much of the complexity actually contributes to reducing uncertainty in Y. In many real-world domains, even if substantial information is stored (high C), the gain in predictability (E) might be modest. This situation is often encountered in fields where, despite high skill ceilings (i.e. very high forecasting complexity), the net effect of additional expertise is limited because the predictive information is a small fraction of the complexity.
Example of low efficiency.
Let X∈{0,1}100 be the outcome of 100 independent fair coin flips, so each x has H(X)=100 bits.
Define Y∈{0,1} as a single coin flip whose bias is determined by the proportion of heads in X. That is, if x has k heads then:
p(Y=1∣x)=k100,p(Y=0∣x)=1−k100
Total information in Y H(Y): \\
When averaged over all possible X, the mean bias is 0.5 so that Y is marginally a fair coin. Hence,
H(Y)=1 bit
Conditional Entropy or irreducible uncertainty H(Y∣X): \\
Given X, the outcome Y is drawn from a Bernoulli distribution whose entropy depends on the number of heads in X. For typical X (around 50 heads), H(Y∣x)≈1 bit; however, averaging over all X yields a slightly lower value. Numerically, one finds:
H(Y∣X)≈0.98 bits.
Predictable Information E=I(X;Y): \\
With the above numbers, the mutual information is
E=H(Y)−H(Y∣X)≈1−0.98=0.02 bits.
Forecasting Complexity C=H(c(X)): \\
The causal state construction groups together all sequences x with the same number k of heads. Since k∈{0,1,...,100}, there are 101 equivalence classes. The entropy of these classes is given by the entropy of the binomial distribution Bin(100,0.5). Using an approximation:
C≈12log2(2πe(1004))=12log2(2πe⋅25)≈12log2(427)≈4.37 bits.
Predictive Efficiency η:
η=EC≈0.024.37≈0.0046.
In this example, a vast amount of internal structural information (the cost to specify the causal state) is required to extract just a tiny bit of predictability. In practical terms, this means that even if one possesses great expertise—analogous to having high forecasting complexity —the net benefit is modest because the inherent η (predictive efficiency) is low. Such scenarios are common in fields like archaeology or long-term political forecasting, where obtaining a single predictive bit of information may demand enormous expertise, data, and computational resources.
I cannot comment on the math, but intuitively this seems wrong.
Zagorsky (2007) found that while IQ correlates with income, the relationship becomes increasingly non-linear at higher IQs and suggests exponential rather than logarithmic returns.
Sinatra et al. (2016) found that high-impact research is produced by a small fraction of exceptional scientists, significantly exceeding their simply above-average peers.
Lubinski and Benbow in their Study of Mathematically Precocious Youth found that those in the top 0.01% of ability achieve disproportionately greater outcomes than those in (just) the top 1%.
My understanding is that empirical evidence points toward power law distributions in the relationship between intelligence and real-world impact, and that intelligence seems to broadly enable exponentially improving abilities to modify the world in your preferred image. I’m not sure why this is.
The most straightforward explanation would be that there are more underexploited niches for top-0.01%-intelligence people than there are top-1%-intelligence people.
I don’t dispute these facts.
Dunno if this is meant to be inspired by/a formalization of [my previous position against intelligence](https://www.lesswrong.com/posts/puv8fRDCH9jx5yhbX/johnswentworth-s-shortform?commentId=jZ2KRPoxEWexBoYSc). But if it is meant to be inspired by it, I just want to flag/highlight that this is the opposite of my position because I’d say intelligence does super well on this hypothetical task because it can just predict 50/50 and be nearly optimal. (Which doesn’t imply low marginal return to intelligence because then you could go apply the intelligence to other tasks.) I also think it is extremely intelligent [perjorative] of you to say that this sort of thing is common in archaeology and political forecasting.
People read more into this shortform than I intended. It is not a cryptic reaction, criticism, or reply to/of another post.
I don’t know what you mean by intelligent [pejorative] but it sounds sarcarcastic.
To be clear, the low predictive efficiency is not a dig at archeology. It seems I have triggered something here.
Whether a question/domain has low or high (marginal) predictive effiency is not a value judgement, just an observation.
Ah, fair enough! I just thought given the timing, it might be that you had seen my post and thought a bit about the limitations of intelligence.
The reason I call it intelligent is: Intelligence is the ability to make use of patterns. If one was to look for patterns in intelligent political forecasting and archaeology, or more generally patterns in the application of intelligence and in the discussion of the limitations of intelligence, then what you’ve written is a sort of convergent outcome.
It’s [perjorative] because it’s bad.
I mean I’m just highlighting it here because I thought it was probably a result of my comments elsewhere and if so I wanted to ping that it was the opposite of what I was talking about.
If it’s unrelated then… I don’t exactly want to say “carry on” because I still think it’s bad, but I’m not exactly sure where to begin or how you ended up with this line of inquiry, so I don’t exactly have much to comment on.
I am not sure what ‘it’ refers to in ‘it is bad’.