Heard that story many times by or about exchange students to the US.
p.b.
What gives you the impression of low integrity?
There’s an interestingly pernicious version of a selection effect that occurs in epistemology, where people can be led into false claims because when people try to engage with arguments, people will drop out at random steps, and past a few steps or so, the people who believe in all the arguments will have a secure-feeling position that the arguments are right, and that people who object to the arguments are (insane/ridiculous/obviously trolling), no matter whether the claim is true:
I find this difficult to parse: people, people, people, people, people.
These seem to be at least three different kind of people: The evangelists, the unconvinced (who drop out) and the believers (who don’t drop out). Not clearly distinguishing between these groups makes the whole post more confusing than necessary.
The function of the feedforward components in transformers is mostly to store knowledge and to enrich the token vectors with that knowledge. The wider you make the ff-network the more knowledge you can store. The network is trained to put the relevant knowledge from the wide hidden layer into the output (i.e. into the token stream).
I fail to see the problem in the fact that the hidden activation is not accessible to future tokens. The ff-nn is just a component to store and inject knowledge. It is wide because it has to store a lot of knowledge, not because the hidden activation has to be wide. The full content of the hidden activation in isolation just is not that relevant.
Case in point: Nowadays the ff-nns actually look different than in GPT-3. They have two hidden layers with one acting as a gating mechanism: The design has changed to allow the possibility to actively erase part of the hidden state!
Also: This seems very different from what you are talking about in the post, it has nothing to do with “the next run”. The hidden layer activations aren’t even “accessible” in the same run! They are purely internal “gears” of a subcomponent.
It also seems to me like you have retreated from
with its intermediate states (“working memory”) completely wiped.
to “intermediate activations of ff-components are not accessible in subsequent layers and because these are wider than the output not all information therein contained can make it into the output”.
What I was pointing to was the fact that the feed forward networks for the new token don’t have access to the past feed-forward states of the other tokens [...] When curing cancer the second time, it didn’t have access to any of the processing from the first time. Only what previous layers outputted for previous tokens.
That is the misconception. I’ll try to explain it in my words (because frankly despite knowing how a transformer works, I can’t understand Radford Neal’s explanation).
In the GPT architecture each token starts out as an embedding, which is then in each layer enriched with information from previous tokens and knowledge stored in the nn itself. So you have a vector which is modified in each layer, let’s call the output of the -th layer:
The computation of accesses the v of all previous tokens! So in your example, if in layer at some token the cure for cancer is discovered, all following tokens will have access to that information in layer . The model cannot forget this information. It might never access it again, but the information will always be there for the taking.
This is in contrast to a recurrent neural network that might actually forget important information if it is unfortunate in editing its state.
I think even in the case that AI 2027 is directionally correct (very fast AI progress) the concrete details are likely to be wrong, so I’m not sure how impressed one should be if your predictions turn out to be correct.
About “it’s all just vibes”: AI 2027 is strongly based on the METR time horizon analysis. I think it would be more fruitful to critique and analyse that. Stuff like the time from SC to SAI seems like epicycles. Though the biggest uncertainty in AI 2027 probably comes from the assumption of recursive improvement.
I am not sure how fruitful the “shallow vs deep thinking” terminology is. What you explain in more detail is what I call “knowledge integration” and “learning while problem solving” which is both about humans having more powerful representations that can be modified while mulling stuff over and improved by integrating data from other domains.
Your algorithmic explanation for LLM shortcomings seems to be wrong and based on a misunderstanding of how LLMs work:
As joseph_c already mentioned the human brain (as an nn architecture) is much, much wider and shallower than a GPT. One of your examples, coming up with clever jokes, also doesn’t require enough time for humans to engage in a lot of recursive thought.
Also, LLMs do actually keep the entire earlier state around, that’s what the KV-cache is. The computation of each new token does access the fine-grained vector representation of earlier tokens. There is no memory wiping going on.
I think the opposite is correct: LLMs are not nearly wide enough. As a consequence their representation of the “the problem” or “the situation” is impoverished.
I think this insight is really interesting! Especially the potential connection to LLMisms.
But I don’t really understand why you chose these experiments. It seems to me the things to check or prove are:
current tokenizers do actually tokenize typical training data so that short tokens are more common
current models do produce text that recapitulates this bias
how the k for topk-sampling exacerbates this bias depending on k
how this changes some typical completions
You do significantly more work to show the effect in a toy setting that may or may not bear on the real case. And I think the outcome of your experiments is already clear before you do them because the effect of top-k sampling on tokens with low/high probability is not complicated (and well explained by you in the post).
because tokens are too low bandwidth
That’s also my impression: https://www.lesswrong.com/posts/KrgBkqeChtAWuPsLP/what-llms-lack
The 4-month doubling trend implies getting 8h+ horizon length by early 2026 and an order of magnitude more by mid-2027. If the best time horizon length in mid-2027 would be 9h, would you feel like you had won the argument, even if you had won the bet?
I think it is a cool idea and has its application but you are right that it seems very unlikely to contribute to AGI in any way. But there was nonetheless excitement about integrating KANs into transformers which was easy to do but just didn’t improve anything.
SSMs are really quite similar to transformers. Similar to all the “sub-quadratic” transformer variants the expectation is at best that they will do the same thing but more efficiently than transformers.
HRMs or continuous thought machines or KANs on the other hand contain new and different ideas that make a discontinuous jump in abilities at least conceivable. So I think one should distinguish between those two types of “promising new architectures”.
My view is that these new ideas accumulate and at some points somebody will be able to put them together in a new way to build actual AGI.
But the authors of these papers are not stupid. If there was straightforward applicability to language modelling they would already have done that. If there was line of sight for GPT4 level abilities in six month they probably wouldn’t publish the paper.
Empathy is not: That person acts like this. How would I feel if I acted like this? Oh, absolutely disgusted of myself.
Empathy is: This person acts like this. How must he feel inside to act like this? Have I ever felt like that? Can I understand or extrapolate from my experiences how this would be? Maybe from my internal states when I was really exhausted or hangry or drunk or in rage or depressed? Could I imagine having this internal state so that I would act like this? This also involves how the internal state would have to be different to not feel disgusted of yourself.
I think Sailer had it right 30 years ago. It’s mostly just behavioral and physical masculinity/femininity. That may be unfair, but it’s not racism.
Is there already an METR evaluation of Claude 4?
I read that this “spoiled meat” story is pretty overblown. And it doesn’t pass the sniff test either. Most meat was probably eaten right after slaughter, because why wouldn’t you?
Also herbs must have been cheaply available. I also recently learned that every household in medieval Europe had a mother of vinegar.
What LLMs lack
I played a game against GPT-4.5 today and seemed to be the strongest LLM I have played so far. Didn’t hallucinate once, didn’t blunder and reached a drawn endgame after 40 moves.
What helps me to overcome the initial hurdle to start doing work in the morning:
Write a list of the stuff you have to do the next day
Make it very fine-grained with single tasks (especially the first few) being basically no effort.
Tick them off one by one
Also:
Tell people what you have to do and when you are going to do it and that you have done it. Like, a colleague, or your team, or your boss.
Do stuff with other people. Either actually together, like pair programming, or closely intertwined.
I think it also helps to take something you are good at and feel good about and in that context take responsibility for something and/or interact with/present to people. Only this kind of social success will build the confidence to overcome social anxiety, but directly trying to do the social stuff you feel worst about usually backfires (at least for me).
Which is exactly what I am doing in the post? By saying that the question of consciousness is a red herring aka not that relevant to the question of personhood?
Seems a lot harder to write a post a day if one is not holed up in Lighthaven.