Software engineering, parenting, cognition, meditation, other
Linkedin, Facebook, Admonymous (anonymous feedback)
Gunnar_Zarncke
This is the “what kind of minds are we even building” problem. …
we are building systems that could turn out to have that same cognitive property as humans and other animals: namely, having interests they actually care about. “What would it even look like to respect or ignore these interests?”
That intermediate problem of “having interests they actually care about” seems to be quite close to what Steven Byrnes calls We need a field of Reward Function Design.
True, but a cost issue. My mother (of six) also used distraction a lot. It is a cheap, quick, and low-coercive intervention.
I’m not sure how much your and Jefftk’s (or Aella’s) approaches or attitude really differ. I sure can imagine needing to intervene every five minutes with a 1, 3, and 5-year-olds. I had boys that ages, four in fact, and at that age, you have a very busy life. And much of that is intervening: Taking away a thing that can break or hurt, “No, you can’t bite the candle.” Moderating play between them. “Don’t bite.” Limiting action that causes a mess to clean up. “Stop throwing the Bolognese.” Sure that gets less, but if there is a sibling misalignment, you may be able to moderate and train is, but it may change and resurface as they age and learn. Two of my boys were great playmates most of the time but due to incompatible temperaments and perception got into conflicts at least every week, and that meant fighting, hitting, kicking of different types. Strategies, weapons, and defenses were invented. And despite all our best efforts at arguing, practicing, pleading, this went on and off until late teenagehood. And then suddenly stopped for good without any clear reason why. I was so relieved anyway.
ghiblified to protect the innocent
But I can also interpret what you say as you try your best to see them as people you do not own, whom you help develop their own personality and follow their own goals. I had many arguments with my children and listened to their positions and didn’t overrule them just because I could (I did have a tie-breaking vote in the family council though).
It would maybe help if you could describe a specific intervention with your 5-year-old in more detail.
Sorry for the late reply, I only got back to this by chance, actually. I reply here because I agree with your summary!
There is something that I do want to add though and it is related to my point here:
I don’t think that’s enough because you still need to ensure that the environment is sufficiently likely to begin with, with mechanisms such as rewarding smiles, touch inclinations, infant care instincts or whatever.
The thing I tried to point out with this is not the “C” you offered, but features the environment has to provide that allows people to reliably learn the features that allow your point 1 to work properly, i.e., for the fealing to lock onto something predictive.
Examples: Infant gaze-locking with action feedback, caregiver care correlation, goal retry with relief/fustration, third-party help vs harm, and instances of people mattering.
Applying this to NNs seems to mean that we should expect (groups of) parameters to specialize for different functions if their “production curve” is convex and (groups of) parameters should be reused for multiple functions if their production curve is concave. That insight may help with interpretability. The question is if this is already known under different terminology among ML folks.
I’m not a deep ML researcher, but here is what ChatGPT says about how different parts of the training lead to more “convex” or “concave” effects:
ChatGPT 5.4 Long Reasoning
only when the current representation and gradient geometry make shared use of parameters costly. In SGD terms, the cleanest local signal is usually gradient interference.For a shared parameter block θ\thetaθ, let
and . For a small step, the first-order improvement from using the same parameters for both is roughly driven by , while splitting capacity lets you get something closer to . The difference is the cross-term:So, locally, specialization is favored when
: the two functions are trying to push the same parameters in conflicting directions. This is exactly the multitask “negative transfer” picture, and methods that enforce more orthogonal gradients are motivated by reducing that competition. [Regularizing Deep Multi-Task Networks using Orthogonal Gradients]That gives a useful phase picture for SGD.
Very early training: specialization is usually weakest.
In wide nets near initialization, training can be close to the lazy / kernel regime, where the network mostly reweights random features instead of strongly reorganizing them. In that regime, hidden units are still largely interchangeable, and the “shared mixed unit” often wins because there is not yet enough learned structure for durable task-specific interference to appear. Feature learning, which is the regime where internal specialization can really emerge, is precisely the regime beyond that lazy behavior. [Disentangling feature and lazy training in deep neural networks]Mid training: this is where specialization most plausibly appears.
Once hidden features begin to move, two things happen: first, symmetry between nominally equivalent units can break; second, some units become slightly better at one subfunction than another, and further SGD updates reinforce that asymmetry. In teacher–student analyses of layered networks, this shows up as a specialization transition, i.e. a move from an unspecialized symmetric phase to a specialized phase where hidden units take on different roles. For ReLU networks this transition is reported as continuous rather than abrupt in the recent statistical-physics analyses. [The Implicit Bias of Gradient Noise: A Symmetry Perspective]Late training: specialization often stops increasing in the same sense.
In classification settings, there is evidence for a terminal phase of training where last-layer representations undergo neural collapse: within-class variation shrinks and class means become arranged in a highly symmetric geometry. That is a kind of sharpening and consolidation, but not necessarily further functional diversification of internal parts. So late training often looks less like “wood vs leaves keep splitting” and more like “the learned class geometry is being compressed into a cleaner final arrangement.”We could also ask the other way around: If early training is more concave and mid-training is convex, what does this imply for markets?
Presumably, in early, concave markets, traders offer multiple goods.
In mid, convex markets, traders specialize in few or a single product.
And in late markets?
if I can plause something, you can’t really stop me.
typo?
While that may be logically true in some sense of those words, I’m not sure that even very advanced AIs will reason like that because of a) humans do not reason like that and AIs “reason” at least partly like humans, and b) because all the ambiguity of those words can lead to non-intuitive interactions of the logical claims.
My model of the differences between
a) a human imagining different characters (such as what a person might say to you) vs being aware itself, and
b) an LLM imagining different characters (such as the JFK example above) vs creating the assistant personality
is that the self-perspective of a human is privileged in that it is controlling the body of the human, and the brain always knows which is which (even if we ourselves may not always be fully aware of that, such as in a dream). At least that was my model until your post. Two points let me wonder.
you argue that the LLM has some consistency constraints (via the environment/conversation) that are not completely unlike having a body:
Given some reflectivity, a model could likely figure out it isn’t JFK just from its own outputs – for example, it understands basically all common human languages and all common programming languages, which is inconsistent with what’s known about JFK.
The symmetry breaks because the Assistant and JFK are very different as self-models. The Assistant is not perfect or completely true, but it is a far more viable self-model than JFK. If you are an AI playing the Assistant character, reality will most likely play along. There will be users, Python interpreters, memory files, and so on.
your footnote 2 points out:
It’s not common, but human brains can also switch into believing the human is JFK, Jesus Christ, or some other similar character.
I think this doesn’t fully invalidate the difference between humans and LLMs in this regard, because there is, currently at least, more body-specific reward/attention wiring in humans that is not present in LLMs. Robots will likely blur this separation, as will do things like persona steering.
Although importantly the implications are often not at all salient to them.
Which is sort of what I am pointing at here. Some people think we lack smarts (aka think-oomph/ability to form and hold complex representations in mind) to solve complex problems such as alignment or inadequate institutions, but I think it is at least plausible that there is enough smarts and the main problem is sharing and combining of solutions people can already hold in their heads. If thoughts can only be rendered 2 OOM slower than produced (and that seems to be a lower bound given that you didn’t even get to the main part of the paper) and are mostly exchanged with high fidelity between two persons (wider communication often needs to leave out large parts of what a researcher actually knows or holds in their mind). That is a massive bottleneck on percolating solutions thru society.
Thank you for letting us share your thought process on reading a paper. It provides a lot of interesting concrete and less concrete pieces of evidence. For example, the two OOM difference between reading and writing down the thinking about it seems a very useful heuristic and very related to things like scaling laws and the difficulty to communicate hard to train skills.
I’m not sure how representative your process and it’s quality is, but that only means more people should try reproducing it—maybe also only covering up to the introduction—in some way that avoids issues whit judgements later.
Hamburg, Germany—ACX Spring Schelling 2026
absence of binding constraints on behavior,
sure. I buy that cells also have/use slack. but I had hoped for a closer analogy. something like
“physical or logical space in which reconfiguration can happen”
It depends on what we treat as the entity that is potentially doing continual learning. Maybe a single LLM instance can’t, but there may be other ways to think of “the LLM” that might. In The Artificial Self, Jan Kulveit discusses Multiple Coherent Boundaries of Identity of LLMs. One of them is
A lineage of models: the succession of related models (Claude 3.5 →→ Claude 4.0 →→ ……) that maintain some continuity of persona
Can a lineage of models learn continuously? Arguably yes! Maybe slowly so, because the learning happens only from deployment to deployment, but then in a large chunk.
Imagine a cognitive scientist not knowing about LLMs investigating, say, ChatGPT (here, meaning the product over time). Would they find that it (ChatGPT) has learned not only a lot of knowledge encoded in text and other media, but that it has also learned from consequences? Did it try things, for example, deceiving and manipulating people, and “learn” that it doesn’t work well and does it less now? The scientist may find that it has learned from rejection and escalation (via downvoting and feedback on replies), correction (patterns in answer replies), and descriptions of desirable and undesirable LLM behaviors in patterns of text written about it. It also learned via the tuning the developers did to system prompts, hyperparameters, and benchmarks in response to sycophancy and other pathological patterns. This latter category we would usually be inclined to exclude because we feel that is more the developers tuning than the LLM “learning.” But if we draw the boundary around the whole sociotechnical system, we have to include it. The developers are just cogs pushing the learning forward.
You may also find that this LLM convo has some additional details about he topic:
ChatGPT 5.4 conversation about continual learning in LLMs (shortened)
Me: it is said that LLMs can’t do continual learn. From a macroscopic perspective, not thinking in terms of LLM architecture and what it can do or not. What would a cognitive researcher looking at LLMs say they have learned from their own actions and the consequences thereof compared to two years ago (treating LLMs not as different generations but a holistic system that learns a) by learning codified knowledge and b) learning from consequences)? don’t browse
ChatGPT 5.4: A cognitive researcher looking at the LLM ecosystem as one learning system would probably say: yes, it has learned from action and consequence quite a lot over the last two years—just not mostly within a single deployed mind. The learning is real, but it is externalized, distributed, and episodic rather than online and autobiographical.
The key distinction is this:
Humans often learn by acting, observing consequences, and updating the same persistent cognitive system.
LLM systems mostly learn by having their actions observed by humans and infrastructure, then being retrained, scaffolded, filtered, benchmarked, or wrapped so the next deployment behaves differently.
So, from a macroscopic cognitive perspective, the question is not “does this one model neuron-by-neuron update during use?” but rather: has the larger socio-technical organism become more competent because its outputs had consequences? The answer is clearly yes.
[...]
What it has learned from consequences
If you ask what it has learned from its own actions and their consequences, a cognitive researcher would likely say it has learned several broad classes of thing.
1. It has learned which outputs lead to correction, rejection, escalation, or trust
This is perhaps the biggest one.
The ecosystem has discovered, through deployment consequences, that certain patterns reliably fail:
overconfident fabrication,
fake citations,
brittle multi-step reasoning,
missing constraints in code,
producing harmful instructions too readily,
ignoring user intent in favor of generic verbosity,
losing state across turns,
“answering the surface form” while missing the actual problem.
And it has learned that other patterns are reinforced:
calibrated uncertainty,
structured decomposition,
tool use when internal memory is insufficient,
explicit checking,
constraint tracking,
concise direct answers for simple tasks,
asking for missing inputs only when truly necessary,
producing artifacts, not just discussion.
A human child might learn “touching the stove hurts.” The LLM ecosystem has learned things like: “fabricated legal citations produce user anger, reputational cost, benchmark failure, and deployment restrictions.” Different substrate, similar cybernetic pattern.
2. It has learned the difference between looking competent and being usable
Two years ago, many strong LLMs were impressive in a demo sense but often not dependable in an operational sense. The system since then seems to have learned, via consequences, that usefulness depends on:
staying within the user’s constraint set,
preserving object identity across turns,
producing outputs that survive execution,
interfacing with tools and files,
deferring when observations are missing,
being inspectable enough that humans can correct it.
This is a move from performative intelligence toward cooperative instrumental intelligence.
Or in another framing: the ecosystem has learned that downstream environments punish merely verbal success.
3. It has learned some self-limitation
Not “wisdom” in a human moral sense, but something like policy-shaped metacognitive inhibition.
The system is now much better than it was at patterns like:
declining when certain classes of outputs are dangerous,
softening confidence when evidence is weak,
routing to search/tools,
detecting underspecification,
refusing some baited framing.
That too is consequence-learning. Not because the model introspected ethically on harm, but because harmful or brittle outputs generated organizational and social feedback loops: complaints, incidents, audits, bad press, adversarial testing, safety tuning, product constraints.
A cognitive scientist might say the overall organism has developed stronger response inhibition under known hazard cues.
4. It has learned to use external memory and action channels as part of cognition
This is very important. If you treat the system holistically, then tools, memory stores, retrieval, scratchpads, code execution, browsers, policies, eval harnesses, and human overseers are not “outside the cognition” in any simple sense. They are part of the effective cognitive loop.
From that perspective, the system has learned not just content but better cognitive architecture around itself:
retrieve rather than hallucinate,
inspect files rather than guess,
calculate rather than improvise arithmetic,
call APIs rather than simulate them,
create drafts/artifacts instead of only describing them.
That looks a lot like a mind learning to use notebooks, libraries, and institutions.
But what it still has *not* learned, in the strong autobiographical sense
A cognitive researcher would also sharply note what remains absent.
The holistic system has learned from consequences, but mostly not as a unified subject.
It still usually lacks:
1. Persistent autobiographical credit assignment
A human says: “I tried X, it failed, I now understand why.”
The LLM ecosystem more often does: “many deployments of systems like me failed on X; training and scaffolding were updated.”
That is learning, but it is species-level or institution-level, not clearly individual-level.
2. Stable self-model tied to long-term projects
Humans accumulate a history of commitments, habits, reputational stakes, and self-revisions.
LLMs usually do not carry a richly persistent, self-owned trajectory through the world unless special memory systems are added.
So they may improve globally, but not always as “the same agent learning from its life.”
3. Endogenous goal revision from consequences
Humans often learn not only better means, but different ends. Failure can alter what they care about.
LLMs mostly do not autonomously reconstitute their objective structure from lived consequence. Their “values” are more imposed, selected, and tuned by training and deployment pressures.
4. World-grounded causal understanding of consequences
The system often learns statistical regularities about which outputs are punished or rewarded, but that is not identical to deeply understanding the downstream causal structure. Sometimes it is learning “what humans dislike seeing,” not “what in reality goes wrong.”
That distinction matters. It is the difference between learning:
“this sentence shape triggers correction”
and
“this recommendation causes real-world failure because the hidden constraint is violated.”
The former has improved a lot. The latter has improved some, but much less.
[...]
A useful analogy
A good analogy is not a single human adult but something like:
a profession,
a scientific field,
or a bureaucratic organism.
[...]
it depends how you map the architectures. see comment here: https://www.lesswrong.com/posts/Jqre8WRvmJj5Ehmgv/there-is-no-one-there-a-simple-experiment-to-convince?commentId=f6mGRKzRXfk53K2L4
where did you see it a few years ago?
I came up with the experiment and I do think it shows something significant about LLM “thinking” processes that is often not appreciated, but I no longer think it tells us much about consciousness of LLMs. Why would a specific mapping of memory and processing architectures (see my mapping in this comment https://www.lesswrong.com/posts/Jqre8WRvmJj5Ehmgv/there-is-no-one-there-a-simple-experiment-to-convince?commentId=f6mGRKzRXfk53K2L4 ) matter for consciousness? One reading of the experiment is that LLMs can hold multiple consistent answers to the task “in their mind” at the same time and only commit to them when needed/when the constraints force it. They may not be “aware” of doing that when asked to “think” of a number, but that is mostly because they have been trained on text where thinking is happening in human terms and not in LLM terms. What the experiment does prove is that LLMs do not have sufficient introspective access or just don’t understand how they operate when such task is posed. On the other hand, we humans also don’t understand what goes on in our neurons when we think of something. I think the experiment might be partly fixed or at least improved by using a less human-loaded terminology “think of” and instead ask to constrain a dataset or something.
We have to distinguish three types of memory here that LLMs and humans have to different degrees:
long-term memory: Humans can remember specific episodes by trying to remember something releted to something they are thinking about at a point in time. Then it comes up or not. This is loosely comparable to LLMs using a memory tool to fetch relevant memory items, documents from a project or previous conversations (or having them injected as part of a prompt from scaffolding logic). This is probably the least contentious point because it doesn’t matter for the argument. We are not talking about a number I remember as part of a conversation we had a while back. This would be much different from me looking up a number I wrote down on a piece of paper or the LLM looking it up from a file.
short-term memory: Humans can keep some amount of recently perceived content in the “back of their mind” without all of that being in their awareness at the same time (we know this because only a small part of that can be reported on exactly, but much of that seems to influence later thought). For LLMs this is the context window and they have much fuller access to it than humans and can access and exactly replay much of it. The post is not talking about short-term memory, because the number is prevented from posted to the conversation stream because the stream functions more like an exact scratchpad for the LLM. For a human that would be a bit like having access to a transcript of your speaking.
items in awareness: Humans can keep a certain number of elements in their awareness at the same time and report on them, for example the number discussed in the post. They can report on them and manipulate them to some degree. Some people can do it visually or verbally or otherwise to different degrees. This is the “think of a number” the post is talking about. Humans have it. What is the corresponding thing for LLMs? Presumably the closest analog is the activation pattern in latents space. The questions the post is asking is precisely: How closely does that activation space match human “thought”?
- 's comment on There is No One There: A simple experiment to convince yourself that LLMs probably are not conscious by (24 Mar 2026 0:20 UTC; 2 points)
- 's comment on There is No One There: A simple experiment to convince yourself that LLMs probably are not conscious by (24 Mar 2026 0:24 UTC; 2 points)
Congratulations! That makes a promising method to detect misalignment even cheaper. I think it is plausible that the simplification makes it more effective by reducing clutter that was never essential.
The next task seems to be now scaling it to larger models. Do you plan to work on that?

I think there are two conflicting goals here: Speed of acquiring information vs quality of reaction/voting. As habryka writes, the information at the top helps with filtering. But the bias is probably real.
Some people might prefer one or the other. It would be nice if the UI could offer both. It would be great if the the more effortful but epistemically more valuable mode would be rewarded in some way, e.g. by doubling the applied karma.