Here’s a simplified example for people who have never traded in the stock market. We have a biased coin with 80% probability of heads. What’s the probability of tossing 3 coins and getting 3 heads? 51.2%. Assuming first coin was heads, what’s the probability of getting other two coins also heads? 64%
Each coin toss is analogous to whether the next model follows or does not follow scaling laws.
With coin, the options are “head” and “tails”, so “head” moves you in one direction.
With LLMs, the options are “worse than expected”, “just as expected”, “better than expected”, so “just as expected” does not have to move you in a specific direction.
I don’t think this analogy works on multiple levels. As far as I know, there isn’t some sort of known probability that scaling laws will continue to be followed as new models are released. While it is true that a new model continuing to follow scaling laws is increased evidence in favor of future models continuing to follow scaling laws, thus shortening timelines, it’s not really clear how much evidence it would be.
This is important because, unlike a coin flip, there are a lot of other details regarding a new model release that could plausibly affect someone’s timelines. A model’s capabilities are complex, human reactions to them likely more so, and that isn’t covered in a yes/no description of if it’s better than the previous one or follows scaling laws.
Also, following your analogy would differ from the original comment since it moves to whether the new AI model follows scaling laws instead of just whether the new AI model is better than the previous one (It seems to me that there could be a model that is better than the previous one yet still markedly underperforms compared to what would be expected from scaling laws).
If there’s any obvious mistakes I’m making here I’d love to know, I’m still pretty new to the space.
where x(t) is a value is sampled from X(t) distribution for all t.
In plain English, given the last value you get a probability distribution for the next value.
In the AI example: Given x(2025), estimate probability distribution X(2030) where x is the AI capability level.
Possibilities
a) x(t+1) value is determined by x(t) value. There is no randomness. No new information is learned from x(t).
b) X(t+1) distribution is conditional on the value of x(t). Learning which value x(t) was sampled from distribution X(t) distribution gives you new information. However you sampled one of those values such that P(x(t+1) | x(t-1), x(t-2), ...) = P(x(t+1) | x(t), x(t-2) ). You got lucky, and the value sampled ensures distribution remains the same.
c) You learned new information and the probability distribution also changed.
a is possible but seems to imply overconfidence to me.
b is possible but seems to imply extraordianry luck to me, especially if it’s happening multiple times.
Another way of operationalizing the objections to your argument are: what is the analogue to the event “flips heads”? If the predicate used is “conditional on AI models achieving power level X, what is the probability of Y event?” and the new model is below level X, by construction we have gained 0 bits of information about this.
Obviously this example is a little contrived, but not that contrived, and trying to figure out what fair predicates are to register will result in more objections to your original statement.
Here’s a simplified example for people who have never traded in the stock market. We have a biased coin with 80% probability of heads. What’s the probability of tossing 3 coins and getting 3 heads? 51.2%. Assuming first coin was heads, what’s the probability of getting other two coins also heads? 64%
Each coin toss is analogous to whether the next model follows or does not follow scaling laws.
With coin, the options are “head” and “tails”, so “head” moves you in one direction.
With LLMs, the options are “worse than expected”, “just as expected”, “better than expected”, so “just as expected” does not have to move you in a specific direction.
I made a reply. You’re referring to situation b.
I don’t think this analogy works on multiple levels. As far as I know, there isn’t some sort of known probability that scaling laws will continue to be followed as new models are released. While it is true that a new model continuing to follow scaling laws is increased evidence in favor of future models continuing to follow scaling laws, thus shortening timelines, it’s not really clear how much evidence it would be.
This is important because, unlike a coin flip, there are a lot of other details regarding a new model release that could plausibly affect someone’s timelines. A model’s capabilities are complex, human reactions to them likely more so, and that isn’t covered in a yes/no description of if it’s better than the previous one or follows scaling laws.
Also, following your analogy would differ from the original comment since it moves to whether the new AI model follows scaling laws instead of just whether the new AI model is better than the previous one (It seems to me that there could be a model that is better than the previous one yet still markedly underperforms compared to what would be expected from scaling laws).
If there’s any obvious mistakes I’m making here I’d love to know, I’m still pretty new to the space.
I’ve made a reply formalising this.
Update based on the replies:
I basically see this as a Markov process.
X(t+1) = P(x(t+1) | x(t), x(t-1), x(t-2), ...) = F(x(t))
where x(t) is a value is sampled from X(t) distribution for all t.
In plain English, given the last value you get a probability distribution for the next value.
In the AI example: Given x(2025), estimate probability distribution X(2030) where x is the AI capability level.
Possibilities
a) x(t+1) value is determined by x(t) value. There is no randomness. No new information is learned from x(t).
b) X(t+1) distribution is conditional on the value of x(t). Learning which value x(t) was sampled from distribution X(t) distribution gives you new information. However you sampled one of those values such that
P(x(t+1) | x(t-1), x(t-2), ...) = P(x(t+1) | x(t), x(t-2) )
. You got lucky, and the value sampled ensures distribution remains the same.c) You learned new information and the probability distribution also changed.
a is possible but seems to imply overconfidence to me.
b is possible but seems to imply extraordianry luck to me, especially if it’s happening multiple times.
c seems like the most likely situation to me.
Another way of operationalizing the objections to your argument are: what is the analogue to the event “flips heads”? If the predicate used is “conditional on AI models achieving power level X, what is the probability of Y event?” and the new model is below level X, by construction we have gained 0 bits of information about this.
Obviously this example is a little contrived, but not that contrived, and trying to figure out what fair predicates are to register will result in more objections to your original statement.
I’ve made a reply formalising this.