If I understand correctly, the next-token prediction of Mess3 is related to the current-state prediction by a nonsingular linear transformation. So a linear probe showing “the meta-structure of an observer’s belief updates over the hidden states of the generating structure” is equivalent to one showing “the structure of the next-token predictions”, no?
Nisan
The subject of this post appears in the “Did you know...” section of Wikipedia’s front page(archived) right now.
I’m saying “transformers” every time I am tempted to write “LLMs” because many modern LLMs also do image processing, so the term “LLM” is not quite right.
“Transformer”’s not quite right either because you can train a transformer on a narrow task. How about foundation model: “models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks”.
I agree 100%. It would be interesting to explore how the term “AGI” has evolved, maybe starting with Goertzel and Pennachin 2007 who define it as:
a software program that can solve a variety of complex problems in a variety of different domains, and that controls itself autonomously, with its own thoughts, worries, feelings, strengths, weaknesses and predispositions
On the other hand, Stuart Russell testified that AGI means
machines that match or exceed human capabilities in every relevant dimension
so the experts seem to disagree. (On the other hand, Stuart & Russell’s textbook cite Goertzel and Pennachin 2007 when mentioning AGI. Confusing.)
In any case, I think it’s right to say that today’s best language models are AGIs for any of these reasons:
They’re not narrow AIs.
They satisfy the important parts of Goertzel and Pennachin’s definition.
The tasks they can perform are not limited to a “bounded” domain.
In fact, GPT-2 is an AGI.
Maybe the right word for this would be corporatism.
I’m surprised to see an application of the Banach fixed-point theorem as an example of something that’s too implicit from the perspective of a computer scientist. After all, real quantities can only be represented in a computer as a sequence of approximations — and that’s exactly what the theorem provides.
I would have expected you to use, say, the Brouwer fixed-point theorem instead, because Brouwer fixed points can’t be computed to arbitrary precision in general.
(I come from a mathematical background, fwiw.)
For reference, here’s the Gears of Aging sequence.
This article saved me some time just now. Thanks!
Scaling temperature up by a factor of 4 scales up all the velocities by a factor of 2 [...] slowing down the playback of a video has the effect of increasing the time between collisions [....]
Oh, good point! But hm, scaling up temperature by 4x should increase velocities by 2x and energy transfer per collision by 4x. And it should increase the rate of collisions per time by 2x. So the rate of energy transfer per time should increase 8x. But that violates Newton’s law as well. What am I missing here?
constant volume
Ah, so I’m working at a level of generality that applies to all sorts of dynamical systems, including ones with no well-defined volume. As long as there’s a conserved quantity , we can define the entropy as the log of the number of states with that value of . This is a univariate function of , and temperature can be defined as the multiplicative inverse of the derivative .
if the proportionality depends on thermodynamic variables
By
I mean
for some constant that doesn’t vary with time. So it’s incompatible with Newton’s law.
This asymmetry in the temperature dependence would predict that one subsystem will heat faster than the other subsystem cools
Oh, the asymmetric formula relies on the assumption I made that subsystem 2 is so much bigger than subsystem 1 that its temperature doesn’t change appreciably during the cooling process. I wasn’t clear about that, sorry.
Yeah, as Shankar says, this is only for conduction (and maybe convection?). The assumption about transition probabilities is abstractly saying there’s a lot of contact between the subsystems. If two objects contact each other in a small surface area, this post doesn’t apply and you’ll need to model the heat flow with the heat equation. I suppose radiative cooling acts abstractly like a narrow contact region, only allowing photons through.
I am suspicious of this “Lambert’s law”. Suppose the environment is at absolute zero—nothing is moving at all. Then “Lambert’s law” says that the rate of cooling should be infinite: our object should itself instantly drop to absolute zero once placed in an absolute-zero environment. Can that be right?
We’re assuming the environment carries away excess heat instantly. In practice the immediate environment will warm up a bit and the cooling rate will become finite right away.
But in the ideal case, yeah, I think instant cooling makes sense. The environment’s coldness is infinite!
Oh neat! Very interesting. I believe your argument is correct for head-on collisions. What about glancing blows, though?
Assume two rigid, spherical particles with the same mass and radius.
Pick a coordinate system (at rest) where the collision normal vector is aligned with the x-axis.
Then move the coordinate system along the x axis so that the particles have equal and opposite x-velocities. (The y-velocities will be whatever.) In this frame, the elastic collision will negate the x-velocities and leave the y-velocities untouched.
Back in the rest frame, this means that the collision swaps the x-velocities and keeps the y-velocities the same. Thus the energy transfer is half the difference of the squared x-velocities, .
I’m not sure that’s proportional to ? The square of the x-velocity does increase with temperature, but I’m not sure it’s linear. If there’s a big temperature difference, the collisions are ~uniformly distributed on the cold particle’s surface, but not on the hot particle’s surface.
Newton’s law of cooling from first principles
I’d love if anyone can point me to anywhere this cooling law (proportional to the difference of coldnesses) has been written up.
Also my assumptions about the dynamical system are kinda ad hoc. I’d like to know assumptions I ought to be using.
We can derive Newton’s law of cooling from first principles.
Consider an ergodic discrete-time dynamical system and group the microstates into macrostates according to some observable variable . ( might be the temperature of a subsystem.)
Let’s assume that if , then in the next timestep can be one of the values , , or .
Let’s make the further assumption that the transition probabilities for these three possibilities have the same ratio as the number of microstates.
Then it turns out that the rate of change over time is proportional to , where is the entropy, which is the logarithm of the number of microstates.
Now suppose our system consists of two interacting subsystems with energies and . Total energy is conserved. How fast will energy flow from one system to the other? By the above lemma, is proportional to .
Here and are the coldnesses of the subsystems. Coldness is the inverse of temperature, and is more fundamental than temperature.
Note that Newton’s law of cooling says that the rate of heat transfer is proportional to . For a narrow temperature range this will approximate our result.
Wow, that’s a lot of kale. Do you eat 500g every day? And 500g is the mass of the cooked, strained kale?
What a beautiful illustration of how a Humanist’s worldview differs from a Cousin’s!
I wonder why Gemini used RLHF instead of Direct Preference Optimization (DPO). DPO was written up 6 months ago; it’s simpler and apparently more compute-efficient than RLHF.
Is the Gemini org structure so sclerotic that it couldn’t switch to a more efficient training algorithm partway through a project?
Is DPO inferior to RLHF in some way? Lower quality, less efficient, more sensitive to hyperparameters?
Maybe they did use DPO, even though they claimed it was RLHF in their technical report?
I suppose if you had more hidden states than observables, you could distinguish hidden-state prediction from next-token prediction by the dimension of the fractal.