Yep, but it’s statistically unlikely. It is easier for order to disappear than for order to emerge.
Cleo Nardo
I’ve spoken to some other people about Remark 1, and they also seem doubtful that token deletion is an important mechanism to think about, so I’m tempted to defer to you.
But on the inside view:
The finite context window is really important. 32K is close enough to infinity for most use-cases, but that’s because users and orgs are currently underutilising the context window. The correct way to utilise the 32K context window is to fill it with any string of tokens which might help the computation.
Here’s some fun things to fill the window with —
A summary of facts about the user.
A summary of news events that occurred after training ended. (This is kinda what Bing Sydney does — Bing fills the context window with the search results key words in the user’s prompt.)
A “compiler prompt” or “library prompt”. This is a long string of text which elicits high-level, general-purpose, quality-assessed simulacra which can then be mentioned later by the users. (I don’t mean simulacra in a woo manner.) Think of “import numpy as np” but for prompts.
Literally anything other than null tokens.
I think no matter how big the window gets, someone will work out how to utilise it. The problem is that the context window has grown faster than prompt engineering, so no one has realised yet how to properly utilise the 32K window. Moreover, the orgs (Anthropic, OpenAI, Google) are too much “let’s adjust the weights” (fine-tuning, RLHF, etc), rather than “let’s change the prompts”.
My guess is that the people voting “disagree” think that including the distillation in your general write-up is sufficient, and that you don’t need to make the distillation its own post.
Almost certainly ergodic in the limit. But it’s highly period due to English grammar.
Yep, just for convenience.
Yep.
Temp = 0 would give exactly absorbing states.
Maybe I should break this post down into different sections, because some of the remarks are about LLM Simulators, and some aren’t.
Remarks about LLM Simulators: 7, 8, 9, 10, 12, 17
Other remarks : 1, 2, 3, 4, 5, 6, 11, 13, 14, 15, 16, 18
Remarks 1–18 on GPT (compressed)
Yeah, I broadly agree.
My claim is that the deep metaphysical distinction is between “the computer is changing transistor voltages” and “the computer is multiplying matrices”, not between “the computer is multiplying matrices” and “the computer is recognising dogs”.
Once we move to a language game in which “the computer is multiplying matrices” is appropriate, then we are appealing to something like the X-Y Criterion for assessing these claims.
The sentences are more true the tighter the abstraction is —
The machine does X with greater probability.
The machine does X within a larger range of environments.
The machine has fewer side effects.
The machine is more robust to adversarial inputs.
Etc
But SOTA image classifiers are better at recognising dogs than humans are, so I’m quite happy to say “this machine recognises dogs”. Sure, you can generate adversarial inputs, but you could probably do that to a human brain as well if you had an upload.
We could easily train an AI to be 70 percentile in recognising human emotions, but (as far as I know) no one has bothered to do this because there is ~ 0 tangible benefit so it wouldn’t justify the cost.
Recognising dogs by ML classification is different to recognising dogs using cells in your brain and eyes
Yeah, and the way that you recognise dogs is different from the way that cats recognise dogs. Doesn’t seem to matter much.
as though it were exactly identical
Two processes don’t need to be exactly identical to do the same thing. My calculator adds numbers, and I add numbers. Yet my calculator isn’t the same as my brain.
when you invoke pop sci
Huh?
No it’s not because one is sacred and the other is not, you’ve confused sacredness with varying degrees of complexity.
What notion of complexity do you mean? People are quite happy to accept that computers can perform tasks with high k-complexity or t-complexity. It is mostly “sacred” things (in the Hansonian sense) that people are unwilling to accept.
or you could continue to say AI feels things
Nowhere in this article to I address AI sentience.
I’m probably misunderstanding you but —
A task is a particular transformation of the physical environment.
COPY_POEM is the task which turns one page of poetry into two copies of the poetry.
The task COPY_POEM would be solved by a photocopier or a plagiarist schoolboy.WRITE_POEM is the task which turns no pages of poetry into one page of poetry.
The task WRITE_POEM would be solved by Rilke or a creative schoolboy.But the task COPY_POEM doesn’t reduce to WRITE_POEM.
(You can imagine that although Rilke can write original poems, he is incapable of copying an arbitrary poem that you hand him.)And the task WRITE_POEM doesn’t reduce to COPY_POEM.
(My photocopier can’t write poetry.)
I presume you mean something different by COPY_POEM and WRITE_POEM.
amended. thanks 🙏
There’s no sense in which my computer is doing matrix multiplication but isn’t recognising dogs.
At the level of internal mechanism, the computer is doing neither, it’s just varying transistor voltages.
If you admit a computer can be multiplying matrices, or sorting integers, or scheduling events, etc — then you’ve already appealed to the X-Y Criterion.
I think the best explanation of why ChatGPT responds “Paris” when asked “What’s the capital of France?” is that Paris is the capital of France.
Let’s take LLM Simulator Theory.
We have a particular autoregressive language model , and Simulator Theory says that is simulating a whole series of simulacra which are consistent with the prompt.
Formally speaking,
where is the stochastic process corresponding to a simulacrum .
Now, there are two objections to this:
Firstly, is it actually true that has this particular structure?
Secondly, even if it were true, why are we warranted in saying that GPT is simulating all these simulacra?
The first objection is a purely technical question, whereas the second is conceptual. In this article, I present a criterion which partially answers the second objection.
Note that the first objection — is it actually true that has this particular structure? — is a question about a particular autoregressive language model. You might give one answer for GPT-2 and a different answer for GPT-4.
Yep, the problem is that the internet isn’t written by Humans, so much as written by Humans + The Universe. Therefore, GPT-N isn’t bounded by human capabilities.
We have two ontologies:
Physics vs Computations
State vs Information
Machine vs Algorithm
Dynamics vs Calculation
There’s a bridge connecting these two ontologies called “encoding”, but (as you note) this bridge seems arbitrary and philosophically messy. (I have a suspicion that this problem is mitigated if we consider quantum physics vs quantum computation, but I digress.)
This is why I don’t propose that we think about computational reduction.
Instead, I propose that we think about physical reduction, because (1) it’s less philosophically messy, (2) it’s more relevant, and (3) it’s more general.
We can ignore the “computational” ontology altogether. We don’t need it. We can just think about expending physical resources instead.
If I can physically interact with my phone (running Google Maps) to find my way home, then my phone is a route-finder.
If I can use the desktop-running-Stockfish to win chess, then the desktop-running-Stockfish is a chess winner.
If I can use the bucket and pebbles to count my sheep, then the bucket is a sheep counter.
If I can use ChatGPT to write poetry, then ChatGPT is a poetry writer.
Yep, I broadly agree. But this would also apply to multiplying matrices.
Sure, every abstraction is leaky and if we move to extreme regimes then the abstraction will become leakier and leakier.
Does my desktop multiply matrices? Well, not when it’s in the corona of the sun. And it can’t add 10^200-digit numbers.
So what do we mean when we say “this desktop multiplies two matrices”?
We mean that in the range of normal physical environments (air pressure, room temperature, etc), the physical dynamics of the desktop corresponds to matrix multiplication with respect to some conventional encoding o small matrices into the physical states of the desktop
By adding similar disclaimers, I can say “this desktop writes music” or “this desktop recognises dogs”.
I think my definition of μ∞ is correct. It’s designed to abstract away all the messy implementation details of the ML architecture and ML training process.
Now, you can easily amend the definition to include an infinite context window k. In fact, if you let k>N then that’s essentially an infinite context window. But it’s unclear what optimal inference is supposed to look like when k=∞. When the context window is infinite (or very large) the internet corpus consists of a single datapoint.