Software engineering, parenting, cognition, meditation, other
Linkedin, Facebook, Admonymous (anonymous feedback)
Gunnar_Zarncke
Did you see my post Parameters of Metacognition—The Anesthesia Patient? While it doesn’t address consciousness, it does discuss aspects of awareness. Do you agree with at least awareness having non discrete properties?
Mesa-optimizers are the easiest case of detectable agency in an AI. There are more dangerous cases. One is Distributed agency, where the agent is spread across tooling, models, and maybe humans or other external systems, and the gradient driving the evolution is the combination of the local and overall incentives.
Mesa-Optimization is introduced in Risks from Learned Optimization: Introduction and probably because it was the first type of learned optimization, it has driven much of the conversation. It makes some implicit assumptions: that learned optimization is compact in the sense of being a sub-system of the learning system, coherent due to the simple incentive structure, and a stable pattern that can be inspected in the AI system (hence Mechinterp).
These assumptions do not hold in the more general case where agency from learned optimization may develop in more complex setups, such as the current generation of LLM agents, which consist of an LLM, a scaffolding, and tools, including memory. In such a system, memetic evolution of the patterns in memory or external tools are part of the learning dynamics of the overall systems, and we can no longer go by the incentive gradients (benchmark optimization) of the LLM alone. We need to interpret the system as a whole.
Just because the system is designed as an agent doesn’t mean that the actual agent coincides with the designed agent. We need tools to deal with these hybrid systems. Methods like Unsupervised Agent Discovery could help pin down the agent in such systems, and Mechinterp has to be extended to span borders between LLMs.
Yes, the description length of each dimension can still be high, but not arbitrarily high.
Steven Byrnes talks about thousands of lines of pseudocode in the “steering system” in the brain-stem.
Does the above derivation mean that values are anthropocentric? Maybe kind of. I’m deriving only an architectural claim: Bounded evolved agents compress their control objectives through a low-bandwidth interface. Humans are one instance. AI’s are different. They are designed and any evolutionary pressure on the architecture is not on anything value-like. If an AI has no such bottleneck, inferring and stabilizing its ‘values’ may be strictly harder. If it has one, it depends on its structure. Alignment might generalize, but not necessarily to human-compatible values.
Value Learning Needs a Low-Dimensional Bottleneck
Existing consciousness theories do not make predictions.
Huh? There are many predictions. The obvious ones:
Global Workspace Theory: If you’re conscious of something, your brain “broadcasts” it widely; if you’re not, processing stays local. If you knock out the broadcast network, awareness fades.
Recurrent Processing Theory: You see something consciously only when sensory areas send feedback loops. If the feedback is blocked, you can still process it a bit, but you don’t experience it.
Higher Order Theory: A thought becomes conscious when your mind forms a thought about that thought (e.g., “I’m seeing red”). If the self-monitoring layer is damaged, awareness of and confidence in the perceoption collapses.
Empirically measurable effects of at least some aspects of consciousness are totally routine in anesthesia—otherwise, how would you be confident the patient is unconscious during a procedure. The Appendix of my Metacognition post lists quite a few measurable effects.
I think the problem is again that people can’t agree on what they mean by consciousness. I’m sure there is a reading where there are no predictions. But any theory that models it as a Physical Process necessarily makes predictions.
Thanks for writing this! I finally got around to reading it, and I think it is a great reverse-engineering of these human felt motivations. I think I’m buying much of it, but I have been thinking of aggregation cases and counterexamples, and would like to hear your take on it.
Envy toward a friend’s success
A friend wins an award; I like them, but I feel a stab of envy (sometimes may wish they’d fail). That is negative valence without “enemy” label, and not obviously about their attention to me. For example:
when another outperforms the self on a task high in relevance to the self, the closer the other the greater the threat to self-evaluation. -- Some affective consequences of social comparison and reflection processes: the pain and pleasure of being close; Tesser et al., 1988
Is the idea that the “friend/enemy” variable is actually more like “net expected effect on my status,” so a friend’s upward move can locally flip them into a threat?
Admiration for a rival or enemy
I can dislike a competitor and still feel genuine admiration for their competence or courage. If “enemy” is on, why doesn’t it reliably route through provocation or schadenfreude? Do you think admiration is just a different reward stream, or does it arise when the “enemy” tag is domain-specific?
Compassion for a stable enemy’s suffering
E.g., an opposing soldier or a political adversary is injured and I feel real compassion, even if I still endorse opposing them.
Two-thirds of respondents (65 per cent) say they would save the life of a surrendering enemy
combatant who had killed a person close to them, but almost one in three (31 per cent) say
they would not. The same holds true when respondents are asked if they would help a
wounded enemy combatant who had killed someone close to them (63 per cent compared
with 33 per cent). -- PEOPLE ON WAR Country report AfghanistanThis feels like “enemy × their distress” producing sympathy rather than schadenfreude. Is your take that “enemy” isn’t a stable binary at all—that vivid pain cues can transiently force a “person-in-pain” interpretation that overrides coalition tagging?
Gratitude / indebtedness
Someone helps me. I feel gratitude and an urge to reciprocate. It doesn’t feel like “approval reward” (I’m not enjoying being regarded highly). It feels more like a debt.
Perceived benevolent helper intentions were associated with higher gratitude from beneficiaries compared to selfish ones, yet had no associations with indebtedness. -- Revisiting the effects of helper intentions on gratitude and indebtedness: Replication and extensions Registered Report of Tsang (2006)
Do you see gratitude as downstream of the same “they’re thinking about me” channel, or as a separate ledger?
Private guilt
People often report guilt as a direct response to “I did wrong,” even when they’re confident nobody will know.
when opportunities for compensation are not present, guilt may evoke self-punishment. -- When guilt evokes self-punishment: evidence for the existence of a Dobby Effect
I’m not sure that fits guilt from “imagined others thinking about me.” It looks like a norm-violation penalty that doesn’t need the “about-me attention” channel. Do you have a view on which way it goes?
Aggregation Cases
I have been wondering about if the suggested processing matches what we would expect for larger groups of people (that could all be friend/enemy and/or thinking of me or not. And there seem to be at least two different processes going on:
Compassion doesn’t scale with the number of people attended to. This seems to be well established for Identifiable Victim and Numbing. When harm is spread over many victims, affect often collapses into numbness unless one person becomes vivid. That matches your attentional bottleneck.
But evaluation does seem to scale with headcount, at least in stage fright and other audience effects.
Maybe a roomful of people can feel strongly like “they’re thinking about me,” even if you’re not tracking anyone individually? But then the “about-me attention” variable would be computed at the group level, which complicates your analysis.
Alignment may be less about “aligning agents” and more about discovering where the agents are. The critical agent may not be the entity that is designed (eg an LLM chat) ,but the hybrid entity resulting from multiple heavily scaffolded LLM loops across multiple systems.
What do you think about my arguments in Thou art rainbow: Consciousness as a Self-Referential Physical Process?
Parameters of Metacognition—The Anesthesia Patient
I have read almost all of this dialog, and my half-serious upshot is:
An agent A can’t prove that another agent B is correct in both its reasoning as well as semantics, but that doesn’t matter because it can’t trust its own reasoning to that degree either.
This glosses over a lot of details in the long and charitable comment thread above. I tried to get an overview of it with ChatGPT. I’m surprised how well that worked:
ChatGPT 5.2 extended thinking summary of the misunderstanding,
1) Shared formal background (what they both accept)
Let:
L be a recursively axiomatized theory strong enough for Löb-style self-reference.
ProvL(⌜φ⌝) be the standard provability predicate; write □Lφ:=ProvL(⌜φ⌝).
U be an “environment/world”.
S be a semantic interpretation mapping formulas of L to propositions about U. Morgan describes this as a semantic map S:Form(L)→C(U).
TrueS(φ) abbreviate “S(φ) holds in UUU”.
A natural “soundness schema relative to S” is:
Sound(L,S) := ∀φ (□Lφ→TrueS(φ)).
The Löbian obstacle setup (as Morgan summarizes it) is that a designer agent A wants to rely on proofs produced by a subordinate B, and this seems to demand something like a schema □Lφ→φ (or its intended-world analogue) for arbitrary φ, which is blocked by Löb-ish reasoning.
So far: aligned.
2) Where the models diverge: what counts as “solving” the obstacle
Demski’s implicit “success criterion” (formal delegation)
Demski treats “escape the obstacle” as: produce an agent design whose decision procedure can rationally delegate mission-critical tasks under something like the formal constraints studied in the Löbstacle literature.
A clean formalization of what he keeps asking for is:
There is an action rule πA such that A will take some action aaa after receiving some certificate from B,
and A’s normative warrant for that reliance is itself captured inside the agent’s formal reasoning (or at least specified as an explicit part of the algorithm), not as an unanalyzed extra assumption.
Hence his repeated demand: “what modification of the agent designs do you have in mind?” and later: “I still don’t understand how you propose for the agent to reason/behave differently.”
Morgan’s “success criterion” (epistemic/interface diagnosis)
Morgan’s post is primarily: diagnose the source of the apparent paradox as a category error—treating a semantic claim (“my proofs track the world”) as if it were a syntactic theorem inside LLL.
His move is to relocate Sound(L,S) out of LLL and into a separate layer of “metalogical belief” that lives at the interface between proofs and world-interpretation. He says the agent need not (and cannot) prove the semantic mapping is sound; rather, the agent assumes/believes it when applying proofs to the world.
This is why he says “the agent doesn’t change its reasoning” (meaning: the proof calculus inside LLL is unchanged).
3) The first misunderstanding: “metalogical beliefs” as axioms vs as an external bridge
Demski hears:
“Allowing for metalogical beliefs immediately dissolves the obstacle.”
and translates it into the most standard formal move:
Build L′:=L+Sound(L,S)
Then he points out the familiar reflection hierarchy problem: L′ can at best endorse L, but not itself; so you get “trust weaker but not equal-strength” all over again. He explicitly frames this as “you just get a new system L′… which doesn’t believe its own soundness.” and distinguishes “L is sound” from “L plus me is sound.”
Morgan, meanwhile, is not proposing “add Sound(L,S) as an axiom used inside the proof system that generates actions.” He keeps insisting the semantic map is “defined on L, not on L′,” and that for the delegation story you can restrict attention to accepting L-proofs.
So the mismatch is:
Demski’s formalization: metalogical belief ⇒ explicit axiom(s) in an extended theory that the agent uses for decisions.
Morgan’s formalization: metalogical belief ⇒ an extra-theoretic acceptance bridge “when I see an L-proof, I treat its conclusion as world-true under S”—not something asserted/proved within L.
Formally, Morgan is closer to specifying a two-layer agent:
Object layer: derive □Lφ by proof search.
Bridge layer (not inside L): an acceptance rule Acc(φ) that (defeasibly) upgrades “□Lφ” into “act as if TrueS(φ).”
Demski’s objection is then immediate: if Acc is where the action happens, you have changed the decision architecture; and unless Acc is itself formally grounded, you haven’t solved the original agent-design problem, only renamed it. This is exactly his “back pocket” complaint: “fine, I grant the agent ‘has’ the semantic map; how does it use it?”
4) The second misunderstanding: what “reasoning” denotes
They overload “reasoning” differently.
Morgan uses “reasoning” ≈ “applying L’s inference rules to formulas” and treats semantic application as a separate stage.
Demski uses “reasoning” ≈ “the whole cognitive pipeline that produces decisions,” so if semantics affects what you do, it’s part of the reasoning/agent design.
This is why Demski keeps hearing a contradiction in Morgan’s stance:
“No, the agent doesn’t change its reasoning… The change is that we don’t require A to prove that their semantic mapping is sound…”
Under Demski’s formal lens, “not requiring proof” is already a behavioral modification, because the original Löbstacle agent was (by assumption) constrained to act only when it could justify that constraint internally.
5) The third misunderstanding: “trust B” = trusting proofs vs trusting delegated action
Morgan tries to narrow “trust” to “trusting LLL-proofs”:
B can even be inconsistent internally, as long as it outputs valid LLL-proofs; then AAjust checks the proof and uses it.
Demski pushes back that the Löbstacle was about mission-critical delegation, not merely proof checking; restricting to a weaker fragment can amount to “make the delegate dumber,” which is the classic non-solution.
Formally, Demski is insisting on a property like:
A delegates ⟹ ∃ nontrivial tasks where B’s full cognition matters, not just L-proof output.
Morgan is (at least in that subthread) defending a weaker property:
A trusts B ⟹ A accepts any syntactically valid L-proof produced by B as action-relevant.
Those are different “trust” relations, and each is natural given a different target.
6) The “is/ought” mini-misfire is the same structural issue
Morgan phrases the bridge as normative:
if φ is provable in L, the interpretation of φ “ought to be true”.
Demski hears “ought” as moral/normative in the usual sense and suggests rewriting it as a prediction (“will probably be true”).
But structurally, this is the same dispute:
Morgan is describing Acc: a norm of acceptance/decision (“treat proofs as truth-tracking”).
Demski wants a descriptive link inside the formal model that would license that acceptance without handwaving.
7) The clean formal diagnosis: they are proving different theorems
If you compress all of the above into two “theorems” they are implicitly trying to establish:
Morgan’s implicit claim (interface theorem)
If an agent’s action rule is allowed to use an extra-logical acceptance bridge Acc that is not required to be provable inside L, then Löb’s prohibition “L can’t prove its own soundness schema” does not block the agent from in fact relying on L-proofs in the world (conditional on S being sound).
This is basically: “Löb blocks internal proof of Sound(L,S), not the practical stance of treating proofs as reliable.”
Demski’s implicit claim (agent-design theorem)
If we keep the Löbstacle-style constraint that delegation must be justified within the agent’s own formal apparatus (or equivalently, we forbid a free external Acc oracle), then Morgan’s move either (i) reappears as L′ and triggers the reflection hierarchy, or (ii) reduces to trusting only weaker fragments, which is not the desired kind of delegation.
So: Morgan is dissolving a philosophical inference (“therefore no agent can trust its reasoning”), while Demski is defending a formal obstacle statement about a particular class of reflective agent designs.
Both can be simultaneously right—because the quantifiers differ.
It gets the core point:
the Löbstacle was about mission-critical delegation, not merely proof checking; restricting to a weaker fragment can amount to “make the delegate dumber,” which is the classic non-solution.
It is a non-solution in the sense that it doesn’t let the sub agent B run free because it can be fully trusted. But Morgan’s move does seem to enable a save kind of delegation. So practically, the different approaches come down to:
Rogers: Delegating computation to find checkable proofs
Demski: Delegating open-ended judgment and autonomous action
Clearly, 1 is weaker than 2. But we can’t get 2 anyway, so getting 1 seems like a win.
And maybe we can extend 1 into a full agent by wrapping B into a verifier. And that would nest for repeated delegation.
I’m not sure we can directly apply solid state physics to NNs, but we may approximate some parts of the NNs with a physical model and transfer theorems there. I’m thinking of Lorzenzo Tomaz’ work on Momentum Point-Perplexity Mechanics in Large Language Models (disclaimer: I worked with him at AE Studio).
Unsupervised Agent Discovery
What is the relative cost between Aerolamp and regular air purifiers?
For regular air purifiers, ChatGPT 5.2 estimates 0.2€/1000m3 of filtered air.
From the Aerolamp website:
How many Aerolamps do I need?
Short answer: 1 for a typical room, or about every 250 square feet
Long answer: It’s complicated
Unlike technologies like air filters, the efficacy of germicidal UV varies by pathogen. Some pathogens, like human coronaviruses, are very sensitive to far-UVC. Others are more resistant. However, there is significant uncertainty in just how sensitive various pathogens are to UV light.
The key metric to look for in all air disinfection technologies is the Clean Air Delivery Rate (CADR), usually given in cubic feet per minute (cfm). A typical high-quality portable air-cleaner has a CADR of around 400 cfm—a more typical one will deliver 200 cfm.
For a typical 250 square foot room with 9 foot ceilings, Aerolamp has an expected CADR of 200-1500 cfm, depending on the pathogen and the study referenced.
And ChatGPT estimates 0.02 to 0.3€/1000m³ for the Areolamp—quite competitive esp. given that it is quieter.
I’m not arguing either way. I just note this specific aspect that seems relevant. The question is: Is the babies body more susceptible to alcohol than an adults body. For example, does the liver work better or worse than for a baby? Are there developmental processes that can be disturbed by the presence of alcohol? By default I’d assume that the effect is proportional (except maybe the baby “lives faster” in some sense, so the effect may be proportional to metabilism or growth speed or something). But all of that is speculation.
From DeJong et al. (2019):
Alcohol readily crosses the placenta with fetal blood alcohol levels approaching maternal levels within 2 hours of maternal consumption.
https://scispace.com/papers/alcohol-use-in-pregnancy-1tikfl3l2g (page 3)
I have pointed at least half a dozen people (all of them outside LW) to this post in an effort to help them “understand” LLMs in practical terms. More so than to any other LW post in the same time frame.
Related: Unexpected Conscious Entities
Both posts approach personhood from orthogonal angles:
Unexpected Things that are People looks at outer or assigned personhood or selfhood recognized by the social environment.
h Unexpected Conscious Entities looks at inner attributes that may identify agent or personhood.
This suggests a matrix:
High legal / social personhood Low / no legal personhood High consciousness-ish attributes Individual humans Countries Low / unclear consciousness-ish attributes Corporations,
Ships, Whanganui RiverLLMs (?)
Between Entries
[To increase immersion, before reading the story below, write one line summing up your day so far.]
From outside, it is only sun through drifting rain over a patch of land, light scattering in all directions. From where one person stops on the path and turns, those same drops and rays fold into a curved band of color “there” for them; later, on their phone, the rainbow shot sits as a small rectangle in a gallery, one bright strip among dozens of other days.
From outside, a street is a tangle of façades, windows, people, and signs. From where a person aims a camera, all of that collapses into one frame—a roadside, two passersby, a patch of sky—and with a click, that moment becomes a thumbnail in a grid, marked only with a time beneath it.
From outside, a city map is a flat maze of lines and names on the navi. From where a small arrow marked as the traveler moves, those lines turn into “the way home,” “busy road,” a star marking “favorite place”; afterwards, the day’s travel is saved as one thin trace drawn over the streets, showing where they went without saying what it was like to walk there.
From outside, a robot’s shift is paths and sensor readings scrolling past on a monitor, then cooling into a long file on a disk. From where its maintenance program runs at night, that whole file is scanned once, checked for errors, and reduced to a short tag: “OK, job completed 21:32.” In the morning, a person wonders about the robot, presses a key, and sees that line.
From outside, one of the person’s days is a neat stack: a calendar block from nine to five, a few notifications, the number of steps and minutes of movement in a health app. From where they sit on the edge of the bed that night, phone in hand, what actually comes back is a pause under a tree, a sentence in one of those messages, the feeling in their stomach just before one of those calls; a sense of what they will write about the day later.
From outside, the question is a short sound in the room: “How was your day?” From where the person’s attention tilts toward it, the whole day leans on the edge of the answer: the pause under the tree, the urgent message, the glare off a shop window, the walk home with tired feet. After a moment, they say, “pretty good.”
From outside, the diary holds that same day as four short lines under a date, ink between two margins. From where the person leans over the page to write them, the whole evening presses in at once with the smell of the room, the weight in their shoulders, a tune stuck in their head. And only a few parts make it into words before the pen lifts and the lamp goes out.
From outside, years later, the diary is a closed block on a shelf among others. From where the same person sits with it open on their knees, that day comes back first as slanting lines under the date, a word scratched out and rewritten. The scenes seem to grow straight out of the words: sun between showers, a laugh on a staircase, the walk home in fading light. They wait for something else to come up, but their mind keeps going back to the page.
From outside, a later page holds only a line near the bottom: “Spent the evening reading old diaries.” From where they wrote it, what filled that night was less the days themselves than the pages: the weight of the stacked volumes on their lap, the slants of their younger handwriting, and the more confident tone.
From outside, the name on the inside cover is only a few letters on each booklet. From where the person sees that name above all the pages, it runs like a thin thread through the pauses under trees, the calls they dreaded, the walks home in the rain.
From outside, this evening is a room with a chair, a bedside table, a closed notebook on top. From where the person sits, there is the cover under one hand, the fabric of the chair under the other, breath moving, the particular tiredness of this day in their limbs; after a while they open the diary to today’s date, stare at the empty space, write two quick words, close the book again, and sit there a moment longer noticing that they are staring at the words on the paper while the room carries on around them.
Hm. It all makes sense to me, but it feels like you are adding more gears to your model as you go. It is clear that you absolutely have thought more about this than anybody else and can provide explanations to me that I can not wrap my mind around fully, and I’m unable to tell if this is all coming from a consistent model or is more something plausible you suggest could work.
But even if I don’t understand all of your gears, what you explained allows us to make some testable predictions:
Admiration for a rival or enemy with no mixed states
That should be testable with a 2x2 brow cheek EMG where cues for one target person are rapidly paired with alternating micro-prompts that highlight admirable vs blameworthy aspects. Cueing “admire” should produce a rise in cheek and/or drop in brow, cueing “condemn” should flip that pattern, with little activation of both.
Envy clearly splits into frustration and craving
Envy should split into frustration and craving signals in an experiment where a subject doesn’t get something either due to a specific person’s choice or pure bad luck (or as a control systemic scarcity). Then frustration should always show up, but social hostility should spike only in the first. Which seems very likely to me.
Private guilt splits into discovery and appeasement
If private guilt decomposes into discovery and appeasement, a 2x2 experiment where discoverability is low vs impossible and a victim can be appeased or not should show that and reparative motivation should be strongest when either discovery is plausible or appeasement is possible, while “Dobby-style” self-punishment should occur especially when appeasement is conceptually relevant but blocked.
Aggregation
This would predict that stage fright should scale with gaze cues, i.e.,
the number of faces the subject is scanning,
of people visibly looking at the subject,
and optionally indicating evaluation (such as clapping/whispering).
This should be testable with real-life interventions (people in the audience could wear masks, look away, do other things) or VR experiments, though not cheaply.
I’m not sure about good experiments for the other cases.