I tried some chess but’s it’s still pretty bad. Not noticeably better GPT4.
p.b.
In my case introspection lead me to the realisation that human reasoning consists to a large degree out of two interlocking parts: Finding constraint of the solution space and constraint satisfaction.
Which has the interesting corollary that AI systems that reach human or superhuman performance by adding search to NNs are not really implementing reasoning but rather brute-forcing it.
It also makes me sceptical that LLMs+search will be AGI.
In psychometrics this is called “backward digit span”.
Diminishing returns in loss are not diminishing returns in capabilities. And benchmarks tend to saturate, so diminishing returns are baked in if you look at those.
I am not saying that there aren’t diminishing returns to scale, but I just haven’t seen anything definitive yet.
Frankly, I don’t really understand what you are saying here and I am open to the possibility that I don’t really understand how the gradient works in autoregressive transformers.
But as I said in my other comment, my current understanding is:
In standard attention (for example in an encoder) tokens are not ordered, so it is clear that the gradient of the loss of one of the token predictions (for example a masked token in BERT) flows through all other tokens equally. In autoregressive transformers an order is imposed by masking, but all later tokens attend to all earlier tokens in the same way.
The gradient of the loss of a later tokens flows through all earlier tokens in the same way. It doesn’t matter whether a token is half the context back or all the context, neither for the information flow nor for the gradient flow.
To put it another way: In the n-th layer the last token attends to all the output tokens from the n-1-th layer. It doesn’t somehow have to make do with the output of earlier layers for tokens that are further back.
Yeah, the first 99 tokens would be optimized both to be locally the correct character, and also to set things up so that the 100th character is also correct.
That is how LLMs currently work. The gradient of each token prediction does flow back into all the earlier tokens whose information was integrated into the predicted token. So each token optimizes its own next token prediction but also tries to integrate the information that is most useful for future tokens.
I don’t know how people are creating huge context windows these days, but IIRC the way it works is that the longer you look back into your context (and correspondingly the further you are trying to plan ahead) the less of your computation is available. Like, if you have N layers, then for a token M steps back, you only have access to the computation up until layer N-M.
Everything in the context window is equally available. It doesn’t make a difference whether an earlier token is 5 tokens back or 5000. The attention mechanism is an operation over a set of tokens, there is no intrinsic order.
Scaling curves show strongly diminishing returns to $ spend: A $100m model might not be that far behind a $1b model, performance wise.
What’s your argument for that?
Hah, I didn’t see your answer but our links complement nicely.
I think my first link was the paper that was making some waves when it came out.
This reminds me a lot of a toy project I have in the back of my mind but will probably never get around to:
Which is to train a transformer on the sequences generated by the logic models from the apperception engine paper (which in the paper are inferred by the apperception engine from the sequences) with the aim of predicting the logic model.
The wikipedia page on the technological singularity has some background on that. https://en.wikipedia.org/wiki/Technological_singularity
About 1.) That GPT4 performance jumps most in non-char tests seems to point towards two sources of difficulty in H-tests, with one being tokenization hiding char-level information.
About 2.) To me your results look completely consistent with scale solving H-tests. There are many benchmarks where a certain level has to be reached to leave random performance behind. For your benchmark that level is pretty high, but Claude and GPT4 seem to be above it.
If it’s not scale, what makes Claude and GPT4 capable of making a dent in your benchmark?
About 3.) Finetuning doesn’t convey enough information to completely revamp the representation of the spelling of different tokens. Finetuning mostly doesn’t teach models skills they don’t have. It instead points them squarely at the task they should be doing.
To me the most interesting question is to what extend your network learns to do reasoning/search vs pure pattern recognition.
I trained a transformer to predict tournament chess moves a few years back and my impression was that it played strongly in the opening and always made sensible looking moves but had absolutely no ability to look ahead.
I am currently working on a benchmark of positions that require reasoning and can’t be solved by highly trained intuition alone. Would you be interested in running such a benchmark?
99.2% square accuracy is consistent with 50% position accuracy. Did you check position accuracy?
There was another Chess-GPT investigation into that question recently by Adam Karvonen: https://adamkarvonen.github.io/machine_learning/2024/01/03/chess-world-models.html
The linear probe accuracy for the board state actually peaks in the sixth layer (out of eight). To predict the next move it already discards some of the state information. Well, maybe that is unsurprising.
It also doesn’t reach terribly impressive accuracy. 99.2% sounds a lot, but it is per square, which means it might get a square wrong in every second position.
I think more important than how easy it is to extract the information, is how necessary it is to extract the information. You can probably be somewhat fuzzy about board state details and still get great accuracy.
There is a step in move prediction where you go beyond the intuitive move selection and have to calculate to find deeper reasons for and against moves. This feels similar to me to attending to your uncertainty about the placement of particular pieces beyond the immediate necessity. And all these models have not taken this step yet.
I’m actually doing an analysis right now to nail it down that GPTs don’t calculate ahead when trained on move prediction and stay completely in the intuitive move selection regime, but it’s not easy to separate intuitive move selection and calculation in a bulletproof way.
System 2 thinking that takes a detour over tokens is fundamentally limited compared to something that continuously modifies a complex and highly dimensional vector representation.
Integration of senses will happen, but is the information density of non-text modalities high enough to contribute to the intelligence of future models?
What I call “learning during problem solving” relies on the ability to extract a lot of information from a single problem. To investigate and understand the structure of this one problem. To in the process of doing that building a representation of this problem that can be leveraged in the future to solve problems that have similar aspects.
I think you have defined me as not really talking as I am on the autism spectrum and have trouble telling emotions from tone.
No, he didn’t. Talking is not listening and there’s a big difference between being bad at understanding emotional nuance because of cognitive limitations and the information that would be necessary for understanding emotional nuance never even reaching you brain.
Was Stephen Hawking able to talk (late in life)? No, he wasn’t. He was able to write and his writing was read by a machine. Just like GPT4.
If I read a book to my daughter, does the author talk to her? No. He might be mute or dead. Writing and then having your text read by a different person or system is not talking.
But in the end, these are just words. It’s a fact that GPT4 has no control over how what it writes is read, nor can it hear how what it has written is being read.
If the entire representation of a complex task or problem is collapsed into a text, reading that text and trying to push further is not really “reasoning across calls”. I expect that you can go further with that, but not much further. At least that’s what it looks like currently.
I don’t think you can learn to solve very specific complex problems with the kind of continuous learning that would be possibly to implement with current models. Some of the theorem-prover papers have continuous learning loops that basically try to do this but those still seem very inefficient and are applied to only highly formalised problems whose solutions can be automatically verified.
Yes, multi-modality is not a hard limitation.
I think all the assumptions that go into this model are quite questionable, but it’s still an interesting thought.