Thanks!
p.b.
How does this coefficient relate to the maximal slope (i.e. at the 50%-x)?
Very possible.
I plan to watch this a bit longer and also analyse how the trend changes with repo size.
The way METR time horizons tie into AI 2027 is very narrow: As a trend not even necessarily on coding/software engineering skills but on machine learning engineering. I think that is hard to attack except by claiming that the trend will taper off. AI 2027 does not require unrealistic generalisation.
The reason why I think that time horizons are much more solid evidence of AI progress then earlier benchmarks, is that the calculated time horizons explain the trends in AI-assisted coding over the last few years very well. For example it’s not by chance that “vibe coding” became a thing when it became a thing.
I have computed time horizon trends for more general software engineering tasks (i.e. with a bigger context) and my preliminary results point towards a logistic trend, i.e. the exponential is already tapering off. However, I am still pretty uncertain about that.
And why would anybody do that?
I think babysitting a baby is not very informative about whether you would enjoy having kids. Having a kid is first and foremost about having the deepest and most meaningful emotional connection of your life.
Take that away and you just don’t have a sensible test run. It’s like finding out whether you like hiking by going up and down the stairs of your apartment building all morning.
Having kids is like having parents, except the emotional connection is stronger in the other direction. Would you rather have grown up in an orphanage if that had meant more time for your hobbies and other goals?
I think the most important thing has not been mentioned yet:
How you dress and take care of yourself is the very first and often only impression of how much you have your shit together. Having your shit together—doing the things you need to do in time and doing them well—is the most important trait in a long-term partner.
If the one clearly fucked up receptor copy is sufficient for your “symptoms”, it seems pretty likely that one of your parents should have them too. I think there is no reason to expect a denovo mutation to be particularly likely in your case (unlike in cases that lead to severe disfunction). And of course you can check for that by sequencing your parents.
So my money would be on the second copy also being sufficiently messed up that you have basically no fully functioning oxytocin receptors. If you have siblings and you are the only odd one in the family, you could make a pretty strong case for both copies being messed up, by showing that you are the only one with the combination of frameshift in one copy and particular SNPs in the other. (If you are not the only odd one you can make an even stronger case).
Seems a lot harder to write a post a day if one is not holed up in Lighthaven.
Heard that story many times by or about exchange students to the US.
What gives you the impression of low integrity?
There’s an interestingly pernicious version of a selection effect that occurs in epistemology, where people can be led into false claims because when people try to engage with arguments, people will drop out at random steps, and past a few steps or so, the people who believe in all the arguments will have a secure-feeling position that the arguments are right, and that people who object to the arguments are (insane/ridiculous/obviously trolling), no matter whether the claim is true:
I find this difficult to parse: people, people, people, people, people.
These seem to be at least three different kind of people: The evangelists, the unconvinced (who drop out) and the believers (who don’t drop out). Not clearly distinguishing between these groups makes the whole post more confusing than necessary.
The function of the feedforward components in transformers is mostly to store knowledge and to enrich the token vectors with that knowledge. The wider you make the ff-network the more knowledge you can store. The network is trained to put the relevant knowledge from the wide hidden layer into the output (i.e. into the token stream).
I fail to see the problem in the fact that the hidden activation is not accessible to future tokens. The ff-nn is just a component to store and inject knowledge. It is wide because it has to store a lot of knowledge, not because the hidden activation has to be wide. The full content of the hidden activation in isolation just is not that relevant.
Case in point: Nowadays the ff-nns actually look different than in GPT-3. They have two hidden layers with one acting as a gating mechanism: The design has changed to allow the possibility to actively erase part of the hidden state!
Also: This seems very different from what you are talking about in the post, it has nothing to do with “the next run”. The hidden layer activations aren’t even “accessible” in the same run! They are purely internal “gears” of a subcomponent.
It also seems to me like you have retreated from
with its intermediate states (“working memory”) completely wiped.
to “intermediate activations of ff-components are not accessible in subsequent layers and because these are wider than the output not all information therein contained can make it into the output”.
What I was pointing to was the fact that the feed forward networks for the new token don’t have access to the past feed-forward states of the other tokens [...] When curing cancer the second time, it didn’t have access to any of the processing from the first time. Only what previous layers outputted for previous tokens.
That is the misconception. I’ll try to explain it in my words (because frankly despite knowing how a transformer works, I can’t understand Radford Neal’s explanation).
In the GPT architecture each token starts out as an embedding, which is then in each layer enriched with information from previous tokens and knowledge stored in the nn itself. So you have a vector which is modified in each layer, let’s call the output of the -th layer:
The computation of accesses the v of all previous tokens! So in your example, if in layer at some token the cure for cancer is discovered, all following tokens will have access to that information in layer . The model cannot forget this information. It might never access it again, but the information will always be there for the taking.
This is in contrast to a recurrent neural network that might actually forget important information if it is unfortunate in editing its state.
I think even in the case that AI 2027 is directionally correct (very fast AI progress) the concrete details are likely to be wrong, so I’m not sure how impressed one should be if your predictions turn out to be correct.
About “it’s all just vibes”: AI 2027 is strongly based on the METR time horizon analysis. I think it would be more fruitful to critique and analyse that. Stuff like the time from SC to SAI seems like epicycles. Though the biggest uncertainty in AI 2027 probably comes from the assumption of recursive improvement.
I am not sure how fruitful the “shallow vs deep thinking” terminology is. What you explain in more detail is what I call “knowledge integration” and “learning while problem solving” which is both about humans having more powerful representations that can be modified while mulling stuff over and improved by integrating data from other domains.
Your algorithmic explanation for LLM shortcomings seems to be wrong and based on a misunderstanding of how LLMs work:
As joseph_c already mentioned the human brain (as an nn architecture) is much, much wider and shallower than a GPT. One of your examples, coming up with clever jokes, also doesn’t require enough time for humans to engage in a lot of recursive thought.
Also, LLMs do actually keep the entire earlier state around, that’s what the KV-cache is. The computation of each new token does access the fine-grained vector representation of earlier tokens. There is no memory wiping going on.
I think the opposite is correct: LLMs are not nearly wide enough. As a consequence their representation of the “the problem” or “the situation” is impoverished.
I think this insight is really interesting! Especially the potential connection to LLMisms.
But I don’t really understand why you chose these experiments. It seems to me the things to check or prove are:
current tokenizers do actually tokenize typical training data so that short tokens are more common
current models do produce text that recapitulates this bias
how the k for topk-sampling exacerbates this bias depending on k
how this changes some typical completions
You do significantly more work to show the effect in a toy setting that may or may not bear on the real case. And I think the outcome of your experiments is already clear before you do them because the effect of top-k sampling on tokens with low/high probability is not complicated (and well explained by you in the post).
because tokens are too low bandwidth
That’s also my impression: https://www.lesswrong.com/posts/KrgBkqeChtAWuPsLP/what-llms-lack
The 4-month doubling trend implies getting 8h+ horizon length by early 2026 and an order of magnitude more by mid-2027. If the best time horizon length in mid-2027 would be 9h, would you feel like you had won the argument, even if you had won the bet?
I think it is a cool idea and has its application but you are right that it seems very unlikely to contribute to AGI in any way. But there was nonetheless excitement about integrating KANs into transformers which was easy to do but just didn’t improve anything.
For me the linked site with the statement doesn’t load. And this was also the case when I first tried to access it yesterday. Seems less than ideal.