It seems to me that current-state LLMs don’t learn nearly anything from the context since they have trouble fitting it into their attention span. For example, GPT-5 can create fun stuff from just one prompt and an unpublished LLM solved five out of six problems of IMO 2025, while the six problems together can be expressed by using 3k bytes. However, METR found that “on 18 real tasks from two large open-source repositories, early-2025 AI agents often implement functionally correct code that cannot be easily used as-is, because of issues with test coverage, formatting/linting, or general code quality.”
I strongly suspect that this bottleneck will be ameliorated by using neuralese[1] with big internal memory.
Neuralese with big internal memory
The Meta paper which introduced neuralese had GPT-2 trained to have the thought at the end fed into the beginning. Alas, the amount of bits transferred is equal to the amount of bits in a FLOP number multiplied by the size of the final layer. A potential CoT generates ~16.6 extra bits of information per activation.
At the cost of absolute loss of interpretability, neuralese on steroids could have the LLM of GPT-3′s scale transfer tens ofmillions of bits[2] in the latent space. Imagine GPT-3 175B (which had 96 layers and 12288 neurons in each) receiving an augmentation using the last layer’s results as a steering vector at the beginning, the pre-last layer as a steering vector at the second layer, etc. Or passing the steering vectors through a matrix. These amplifications, at most, double the compute required to run GPT-3, while requiring extra millions of bytes of dynamic memory.
For comparison, the human brain’s short-term memory alone is described by activations of around 86 billions of neurons. And that’s ignoring the middle-term memory and the long-term one...
It seems to me that current-state LLMs don’t learn nearly anything from the context since they have trouble fitting it into their attention span. For example, GPT-5 can create fun stuff from just one prompt and an unpublished LLM solved five out of six problems of IMO 2025, while the six problems together can be expressed by using 3k bytes. However, METR found that “on 18 real tasks from two large open-source repositories, early-2025 AI agents often implement functionally correct code that cannot be easily used as-is, because of issues with test coverage, formatting/linting, or general code quality.”
I strongly suspect that this bottleneck will be ameliorated by using neuralese[1] with big internal memory.
Neuralese with big internal memory
The Meta paper which introduced neuralese had GPT-2 trained to have the thought at the end fed into the beginning. Alas, the amount of bits transferred is equal to the amount of bits in a FLOP number multiplied by the size of the final layer. A potential CoT generates ~16.6 extra bits of information per activation.
At the cost of absolute loss of interpretability, neuralese on steroids could have the LLM of GPT-3′s scale transfer tens of millions of bits[2] in the latent space. Imagine GPT-3 175B (which had 96 layers and 12288 neurons in each) receiving an augmentation using the last layer’s results as a steering vector at the beginning, the pre-last layer as a steering vector at the second layer, etc. Or passing the steering vectors through a matrix. These amplifications, at most, double the compute required to run GPT-3, while requiring extra millions of bytes of dynamic memory.
For comparison, the human brain’s short-term memory alone is described by activations of around 86 billions of neurons. And that’s ignoring the middle-term memory and the long-term one...
However, there is Knight Lee’s proposal where the AIs are to generate multiple tokens instead of using versions of neuralese.
For comparison, the longest contest window is 1M tokens long and is used by Google Gemini. 1M tokens are represented by 16.6 M bits.
People have been talking about neuralese since at least when AI 2027 was published and I think much earlier, but it doesn’t seem to have materialized.