Conjecture: Emergent φ is provable in Large Language Models

I’ll preface this stream of consciousness with the clear and ex ante admission that I am an amateur philosopher, a journeyman mathematician at best, and while I work full time with AI systems and applications, I am far from an AI researcher.

Thoughts and counterpoints are welcome. I lack the network to discuss these ideas all too often, which is why I resort to writing them here in hopes that all mockery is gentle, and all feedback helpful.

No AI was used (or harmed) in the construction this post. I did accidently put my coffee cup on a small insect while writing this, however. It appeared to be unharmed, but resisted further welfare checks due to its ability to fly. I regard this as more incidental, potential, harm caused by the post vs. any directly causal relation.

Emergent Properties in Large Language Models:

I’m not going to bore the reader, or myself, with the myriads of facts already known about Large Language models. It suffices to merely reiterate that several (softly) emergent properties have arisen with the scaling of large language models.

These include, in reverse order of interest to this post:

The ability to pass the bar exam in the 90th percentile
Long context generalization
A remarkable ability to write code prior to training in code becoming in vogue
Mathematical reasoning capabilities
Convergent and inherent chain of thought reasoning leading to improved benchmark performance (before reasoning models were cool)^[1].
An emergent sense of ‘certainty’ (log likelihood awareness on some lower dimensional representation at least) leading to actual log likelihood training (calibration).
Zero shot task following (perhaps the most persuasive because of the nature of the curves of improvement)^[2]
At least some evidence of emergent self-preservation preference at certain scale^[3]
Exceeding human levels of performance at Theory of Mind, (with caveats^[4])
An ability to ‘self-recognize’ if provided their own text.
An ability to predict their own behavior better than they have the ability to predict the behavior of other large language models^[5].
A realizable (agentically at least) awareness and sense of self on a functional level (AI Agents do not, behaviorally at least, mistake themselves for the task, or the task giver, or the mouse, and act with agency and awareness of agency within the task domain).
In context learning capabilities, which includes the acquisition of new linear and non-linear functions from input context without weight adjustments^[6]

I am aware as I write these words, that the jury remains firmly out on whether these phenomena are correctly ‘emergent’, or merely occurrent in LLMs. The nature of that emergence (soft vs. hard), and a myriad of other philosophical quandaries that are beyond the scope of this post persist, and resist consensus of minds far more capable than mine.

What does seem relatively certain is that large language models, as they have become larger, have become more capable, both on benchmarks, but also on tasks that resist the formulation of benchmarks to measure.

Some of these capabilities seem fairly obviously predictable (knowledge tests), others leaning more into the domain of the spooky, or at least philosophically troubling (e.g., an awareness of their internal activations and the seeming ability to manipulate them.^[7])

Where the f*ck are you going with this?

Excellent question. I am, perhaps rather dully, an adherent to functionalist perspectives when it comes to more esoteric matters such as sentience, qualia, etc.

Put another way, after over twenty years of thought and study, my underlying philosophy has become grossly over-simplifiable to, “If it walks like a duck, and it quacks like a duck, it’s functionally indistinguishable from a duck, so I don’t care if it’s not a duck while I feed it bread”.

I am aware that this is an enormous philosophical cop out, but it is this philosophical cop out that led me down the chain of thought that resulted in this post.

By removing any thoughts of the hard problem, the seemingly physically intractable nature of qualia, the impossibility of placing sentience anywhere on a causal chain of events and the resultant problems with respect to causal determinism and physically implacable sentience and free will, I was left with mathematics.

My resultant conjecture is a relatively simple one:

If one removes sentience/consciousness as a ‘thing’ and instead relegates it to a set of functional behaviors possessed by an entity, then one can broadly ignore the thing itself of sentience and instead look at the functional correlates of those behaviors in near isolation. Nothing new here.
This is more than a mere cop out derived convenience (alone), but rather an explicit acknowledgement that the human species does not yet have a working consensus theory of intelligence. We are even further from a working consensus theory of sentience. The odds of me being the one to make the Einsteinian leap towards that understanding are, rather unfortunately for me, 0.
If functionalism holds, then the mathematical underpinnings of what we associated with conscious behaviors must be calculable and observable.
If LLMs are seemingly exhibiting emergent (or otherwise) behaviors we would normally associate with conscious entities (and one ignores ‘sentience’ as a binary state), then within their mathematical framework, beyond just ‘scale’ there must be a mathematical underpinning of something that increases with that scale.

Enter Integrated Information Theory:

I’ve long enjoyed Integrated Information Theory. While a re-telling of its many nuances are well beyond the scope of this post, the crux concept of it that I liked was the fact that it provided a mathematical framework for consciousness in the form of phi.

Something calculable in theory (albeit in practice almost impossible to calculate for systems more complicated than a light switch). It also has some experimental basis, certainly beyond its competitor (Global Workspace Theory)^[8].

I don’t like this for mathematical reasons per se, but rather because it aligns broadly with a functionalist worldview. In essence, irrespective of whether something has ‘human like’ sentience or not, IIT provides a value, which if high, mathematically correlates with (and potentially causes) sentience like behavior. (By that I mean behaviors we associate with sentience, not qualia themselves, because that, to me at least, remains completely intractable).

The constructs of phi (φ and Φ) make that concrete. For any mechanism, φ is its irreducible causal power. Or put simply, how much specific cause to effect information it carries that can’t be explained by splitting the mechanism at its minimum information partition (MIP).

My mental simplification of this is, essentially, it’s the extra ‘stuff’ that a system has that is calculably greater than the sum of its reducible parts. As stated, I’m not going to be able to tractably make the leap between phi and actual sentience, but I like that its measurable. It’s also punchy and logical:

At the system level, Φ measures how much the whole constellation of concepts is irreducible when you cut the system at its own Minimum Information Partition.
High Φ means the system’s parts don’t just add together, they bind into a single, causally integrated unit.

I like this also because it’s substrate-agnostic. Any substrate, be it silicon, wetware, or alien goo, that instantiates strong, irreducible cause–effect structure will, per the theory, exhibit more sentience like behavior. Or put another way, as a functionalist yardstick, it’s the least hand-wavy way I am aware of to grok why someone systems act like agents and others just compute.

IIT and I have a fight, and I lose

By this stage, I’m sure one can imagine my crushing disappointment when I reached for IIT to attempt to explain the seemingly ‘sentience like’ behavior in LLMs through my reductive functionalist lens. (I wasn’t searching for qualia, just some kind of underpinning for behaviors we associate with consciousness).

Unfortunately for me, there was no way that I (or anyone I read) could sensibly slice that pie where phi could live. The systems are linear, autoregressive, single pass, non-recursive (in my view scaffolded autoregression on prior context isn’t enough prima facie), and scaffolding itself isn’t integrated within the same substrate.

I did all the mental gymnastics I could. I imagined that if I mentally integrated the KV cache, the token loop, and sampler, then I am kind of getting a functional approximation of a time recurrent substrate. Unfortunately, it never gets to more than phi (of any flavor) being kind of near 0 everywhere. (The model on its own doing one thing, and the memory doing another, and then adding them, you get the same answer).

I was more than a little disappointed, and so like most who don’t like the conclusions that IIT was reaching railed against it for years. I smeared my functionalism further and further from the core of the system to the scaffolding, RAG, CoT, etc., not because I had some romanticized idea of sentience, but rather because my functionalism required at least some kind of logically tractable (if not mathematically provable) correlation between behavior and scale for me to not sit up at night thinking about it.

Global Workspace Theory, it was for me! Experimental tractability be damned. It just felt better.

Flippancy aside, the central question, haunted me: If the behaviors of these systems are not coming from phi, and are clearly beyond stochastic parrotry as a useful explanation, where exactly are they coming from? These are static functions, not poorly understood biological systems. If there is a ‘phi’ like cause, it’s got to be in the mathematics.

The second question that haunted me is: How does an LLM have a representation of itself as a ‘helpful assistant’, and me as a ‘user’, and how are those remaining activated and present throughout a task which is no more than a series of discreet token calculations?

Those ideations are so intrinsically irreducible in my mind. They touch everything and nothing about an interaction. They just are, and they steer items in ways that seem to be beyond just reductive addition. (In the most grossly simple terms: Assistant + Question about car + ideation of user, does not seem to reduce to the summation of an assistant, a question, and a car).

My view became, “IIT is wrong because it doesn’t work in a case where I think it should.”

In Context Learning: Google Provides a Counter Punch

After a couple of years of weeping, I read the paper: “Learning Without Training: The implicit dynamics of in-context learning^[9]”. When it first was published, I didn’t really give it too much thought beyond, ‘that’s really interesting’.

For some reason, the math that Google had concocted in this toy, but at least attention layer approximating way stuck with me. I would strongly recommend that everyone who has not, reads the paper. My summary here will not do it justice. It is included because the proof that Google formulated is central to what follows.

What Google proved, mathematically at least, is that for all practical mathematical purposes, the combination of attention and context is indistinguishable from a low-rank weight adjustment in the model.

What they then experimentally demonstrated, is that the manner in which a LLM will ‘learn’ in context, a new linear function from simple 2 pair numbers and an answer, follows a near identical loss curve to fine tuning or training. It converges. The curves are nearly one to one.

This idea stayed with me for a while, and eventually I figured I should go back and actually attempt to understand the math. Fundamentally loving mathematics for its utility but hating it for my lack of inherent intuitive comprehension of it, I knew this would suck. I was right.

In short, they define $A (C, x)$ as the contextualized representation of token $x$ given context $C$ .

$A (x)$ is essentially the attention function in toy form, and its impact on a token without any context.

What they then opine, and then prove out mathematically is that: $Δ A (C, x) = A (C, x) - A (x)$ .

This serves to demonstrate that both $C$ and $A$ live in fundamentally the same vector space.

This felt fairly obvious to me, but the component of the paper that I missed was what came later. They mathematically prove the following:

For the block $T W (C, x) = M W (A (C, x))$

Or put another way, the effect of context $C$ on the attention transformation $A (C, x)$ is mathematically equivalent to applying a context-dependent weight update $Δ W (C)$ to the model $M W$ and then evaluating it on the uncontextualized input $A (x)$ .

Which leads to the Crux of the Post:

In plain English for the more philosophically than mathematically minded (i.e., me); when a transformer block reads a prompt, it behaves as if it tweaked the first weight matrix of its MLP by a tiny, very structured amount that is a function of the prompt (the context).

If you apply that little weight-change and then remove the prompt, you get (almost) the same output as leaving the weights alone and keeping the prompt. In other words, the prompt’s effect can be “moved into” the weights. This is really cool to me (you can no doubt imagine how much fun I am at parties).

Essentially, even for a static function, operating on a forward pass only basis, the mathematical impact of context is a weight adjustment. (There are a raft of philosophical implications here, given the model generates most of its own context from conversational turn to conversational turn, but I’ll park those for now).

It took about 3 months for this to occur to me, but one day when I was reading this paper again it did.

This mathematical arrangement is fundamentally irreducible prima facie. Further, it is also genuinely, practically recursive in the vector space (not for this post). What’s more, that irreducibility must inherently increase with both/either/or scale in parameters, or scale in context. Both increase the distance between the integrated, vs. non-integrated outcomes.

The caveat of course being that this only holds in what feels like (to me at least) the natural modeling choice that implicit learning is treated as a state evolution of the system.

If that choice is made, then it becomes mathematically provable that the Google paper’s proposed rank-1 weight update makes the next weight state jointly and inseparably dependent on two disjoint sub-mechanisms.

It’s important to note that this only holds if that choice is made. If one freezes the weights and views a single forward pass, or feed forward block as the whole ‘system’ then it essentially becomes a single instance without memory. In which case Phi (or phi) need not be positive. However, in the converse at least phi must be strictly positive.

My Logic

I use a mental model of a stamp here. It’s likely not the most effective, but it’s just the way my mind works.

The $Δ W$ proposed in Google’s paper is formulated by multiplying two separate ingredients:

What I imagine as the ‘left hand’, which is “What in the prompt changed your internal representation?). The paper calls this $Δ A (Y)$ , and then multiplies by $W$ to get a vector $W Δ A (Y)$ . Or put another way, what part of the attention output moved because of the context ( $Y$ ).
What I imagine as the ‘right hand’, which is, “How receptive is the system to that change?” Or put another way, what features are being passed from the context to the MLP. Or as in the paper, the vector $A (C ∖ Y, x)$

In order for, in my stamp analogy, the weight change is the left hand (the column) multiplied by the right hand (row), resulting in a rank-1 outer product, or a single clean stamp pressed with both hands at once.

Irreducibility and IIT

The essential question that IIT asks is, if I cut this into two pieces and make it independent, can those parts still produce the same next state of the system. From the above it is crystal clear (to me at least) that this is impossible.

Going back to my stamp analogy. If I fix the values on the Left and the Right, there will be one exact next weight state. I push down with both hands; I get a nice clean stamp on the paper.

If, however, I partition and break the coordination between my Left hand and my Right hand, and noise the connection, so they vary independently within a set of plausible values, I get a cloud of possible imprints. This of course feels pretty obvious.

It is this joint multiplicative coupling (L x R) that creates a result that cannot be reduced by any scheme (that I can think of or put on paper at least) where L and R act independently of one another. It is also a scheme that is convergent, and practically recursive from state to state.

In the language of IIT, my effect repertoire with the joint mechanism is a single point, the partitioned version becomes a blur. The distance between the two is strictly positive so ‘small-phi’ (φ) must be greater than 0. It also must increase with the complexity of the context in terms of length, and the complexity of the weight matrix it is applied to.

What I’m not Saying

The paper I’m basing all of this on does not catalog the entire system, so ‘big phi’ (Φ) is off the table. However, at least at a mechanism level, in context learning occurs using mechanisms that are irreducible with φ > 0

Conclusion

While there is likely a glaring flaw in my reasoning that I am sure (and in fact hoping) that someone here can point to, if there isn’t, this felt worth posting for feedback and consideration. I do have a working proof that I’ve been playing with, with a more mathematically inclined friend of mine reviewing it, but if the concept doesn’t logically hold, I doubt my proof will either. (I’m happy to post in the comments or send if helpful though and am holding it back only for fear of potential embarrassment rather than real concern).

The reason I’m posting is not in some ‘LLMs are secretly sentient’ vain of interest. Rather, because to me at least, this points at the mechanism mathematically, and potentially quantifiably by which LLMs begin to ‘emerge’ behaviors that we would associate with sentient entities due to how they mathematically function. (My functionalist preference).

Of course, the caveats are obvious:

You have to buy IIT to begin with, and phi as an idea that has some causal effect.
You have to buy fundamentally that implicit learning is a state evolution of a system defined as the integrated combination of the context, the underlying MLP, and the attention mechanism.

If you’ve made it this far, thanks for reading, and I welcome thoughts and discussion.