Software engineering, parenting, cognition, meditation, other
Linkedin, Facebook, Admonymous (anonymous feedback)
Gunnar_Zarncke
HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs
In Lois McMaster Bujold’s Vorkosigan Saga, Cordelia is pregnant and deals with coups, war, and difficult decisions more than once.
Thank you for your details analysis of outer and inner alignment. Your judgment of the difficulty of the alignment problem makes sense to me. I wish you would have more clearly made the scope clear and that you do not investigate other classes of alignment failure, such as those resulting from multi-agent setups (an organizational structure of agents may still be misaligned even if all agents in it are inner and outer aligned) as well as failures of governance. That is not a critique of the subject but just of failure of Ruling Out Everything Else.
Yeah. Holonomy is applicable: A stable recursive loop that keeps reshaping the data passing through it. But what we need on top of holonomy is that self-reference in a physical system always hits an opacity limit. And I think this is what you mean but your reference to Russell’s vicious-circle: It leads to contradictions because of incremental loss.
Though I wonder if maybe we can Escape the Löbian Obstacle.
You ask: Even if the brain models itself, why should that feel like anything? Why isn’t it just plain old computation?
A system that looks at itself creates a “point of view.” When the brain models the world and also itself inside that world, it automatically creates a kind of “center of perspective.” That center is what we call a subject. That’s what happens when a system treats some information as belonging to the system. How the border of the system is drawn differs, body, brain, mind differs, but the reference will always be a form of “mine.”
The brain can’t see how its own processes work (unless you are an advanced meditator maybe).
So when a signal passes through that self-model, the system can’t break it down; it just receives a simplified or compressed state. That opaque state is what the system calls “what it feels like.”Why isn’t this just a zombie misrepresenting itself? The distinction between “representation of feeling” and “actual feeling” is a dualist mistake. The rainbow is there even if it is not a material arc. To represent something as a felt, intrinsic state just is to have the feeling.
I argue that the inference bottleneck of the brain leads to two separate effects:
subjectivity—the feeling of being me (e.g. “I do it”)
phenomenal appearance—the feeling that there is something (qualia, e.g. “there is red”)
While both effects result from the bottleneck, the way they result from compression of different data streams should show different strength for different interventions. And indeed that is what we observe:
Trip reports on some psychedelics show self-dissolution without transparency (“No self” but strong intrinsic givenness (“I don’t know how this is happening, but no one is experiencing it”),
On the other hand, advanced meditators often report dereification (everything is transparent, perceptions forming can be observed) with intact subjecthood “I am clearly the one experiencing it”).
I have difficulty finding the function that shows which posts I have strongly upvoted. It might be useful to list the direct URLs that provide these functions in this FAQ (not only for top voted, but all, such as /allposts)
Sure. I take it you have meditation experience. What is your take on subjectivity and phenomenal appearance coming apart?
AI Safety Interventions
Thou art rainbow: Consciousness as a Self-Referential Physical Process
It means reductionism isn’t strictly true as ontology.
I think you are working from an intuition of reductionism being wrong, but I’m still not clear about the details of your intuition. A defensible position could be that physics does not contain all the explanatorily relevant information or that reality has irreducible multi-level structure. But you seem to be saying that reductionism is false because subjective perspective is a fundamental ingredient, and you want to prove that via the efficiently computable argument. But I still think it doesn’t work. First, it proves too much.
It isn’t obvious that biological structure isn’t efficiently readable from microstate.
Agree that it is not obvious.
Other macro facts might be but it’s of course less clear.
But it seems pretty clear to me that most biological systems actually do involve dynamics that make it computationally infeasible for an external observer to reconstruct the macrostructure from microstructure observations at a given point. And we can’t appeal to ‘complete history’ to avoid the complexity, because with full history you could also recover the key in the HE case; the only difference is that HE compresses its relevant history into a small, opaque region.
What I do agree with you: Physics only tracks microstructure. But phenomenal awareness, meaning, macro-patterns, and information structure are not obviously reducible as descriptions to microstructure. The homomorphic case is a non-refutable illustration of this non-transparency.
But I disagree that this is caused by a failure of efficient computability; instead, we can see it as a failure of microphysical description to exhaust ontology. This matters because inefficiency is an epistemic constraint on observers, while ontology is about what needs to be included in the description of the world.
If you generalize to optics, then it seems your condition for “exceeding physics” is “not efficiently readable from the microstate,” i.e.X is not a P-efficient function of the physical state.”But then it seems everything interesting exceeds physics: biological structure, weather, economic patterns, chemical reactions, turbulence, evolutionary dynamics, and all nontrivial macrostructure. I’m sort of fine with calling this “beyond” physics in some intuitive sense, but I don’t think that’s what you mean. What work does this non-efficiency do?
I’m worried we talk past each other.
You’re saying:
efficient reconstructibility is unclear in the rainbow case,
but whatever the right story is, it must handle both rainbow-like cases and the engineered homomorphic encryption case,
and if some of those cases force non-efficient supervenience, then we face your trilemma.
That part I agree with.
The point I’ve been trying to get at is: Once the same issue arises for ordinary optical appearances, we’ve left behind the special stakes of step 10! Because in the rainbow case, we all seem to accept (but maybe you disagree):
appearance is fully determined by physics,
but the mapping from microphysics to appearance may be extremely messy or intractable for an external observer,
and we don’t treat that as evidence that the visual appearance “exceeds physics.”
Or, if rainbow-style cases also fall under the trilemma, then the conclusion can’t be “mind exceeds physics.” It would have to be the stronger and more surprising “appearances as such exceed physics” or “macrostructure in general exceeds physics.” That’s quite different from your original framing, which presents the homomorphic encryption case as demonstrating a distinctive epistemic excess of mind relative to physics.
It seems you are biting the bullet and agreeing that the rainbow also has the problem of how a mind can be aware of it when it isn’t (efficiently) reconstructable. But then this seems to generalize to a lot, if not all, phenomena a mind can perceive. Doesn’t this reduce that conception of a mind ad absurdum?
I think step 10 overstates what is shown. You write:
“If a homomorphically encrypted mind (with no decryption key) is conscious … it seems it knows things … that cannot be efficiently determined from physics.”
The move from “not P-efficiently determined from physics” to “mind exceeds physics (epistemically)” looks too strong. The same inferential template would force us into contradictions in ordinary physical cases where appearances are available to an observer but not efficiently reconstructible from the microphysical state.
Take a rainbow. Let p be the full microphysical state of the atmosphere and EM field, and let a be the appearance of the rainbow to an observer. The observer trivially “knows” a. Yet from p, even a quantum-bounded “Laplace’s demon” cannot, in general, P-efficiently compute the precise phenomenal structure of that appearance. The appearance does not therefore “exceed physics.”
If we accepted your step 10’s principle “facts accessible to a system but P-intractable to compute from p outrun physics” we would have to say the same about rainbows:
the rainbow’s appearance to an observer “knows something” physics can’t efficiently determine.
That is an implausible conclusion. The physical state fully fixes the appearance; what fails is only efficient external reconstruction, not physical determination.
Homomorphic encryption sharpens the asymmetry between internal access and external decipherability, but it does not introduce a new ontological gap.
So I agree with the earlier steps (digital consciousness, key distance irrelevance) but think the “mind exceeds physics (epistemically)” inference is a category error: it treats P-efficient reconstructability as a criterion for physical determination. If we reject that criterion in the rainbow case, we should reject it in the homomorphic case too.
Victor Taelin’s notes on Gemini 3
I like the sharp distinction you draw between
“Our Values are (roughly) the yumminess or yearning…”
and
“Goodness is (roughly) whatever stuff the memes say one should value.”
but the post treats these as more separable than they actually are from the standpoint of how the brain acquires preferences.
You emphasize that
“we mostly don’t get to choose what triggers yumminess/yearning”
and that Goodness trying to overwrite that is “silly.” Yet a few paragraphs later you note that
“a nontrivial chunk of the memetic egregore Goodness needs to be complied with…”
before recommending to “jettison the memetic egregore” once the safety-function parts are removed.
But the brain’s value-learning machinery doesn’t respect this separation. “Yumminess/yearning” is not fixed hardware; it’s a constantly updated reward model trained by social feedback, imitation, and narrative framing. The very things you group under “Goodness” supply the majority of training data for what later becomes “actual Values.” The egregore is not only a coordination layer or a memetically selected structure on top, it is also the training signal.
Your own example shows this coupling. You say that
“Loving Connection… is a REALLY big chunk of their Values”
while also being a core part of Goodness. This dual function of a learned reward target and the memetic structure that teaches people to want it, is typical rather than exceptional.
So the key point isn’t “should you follow Goodness or your Values?” but “which training signals should you expose your value-learning architecture to?” Then the Albert failure mode looks less like “he ignored Goodness” and more like “he removed a large portion of what shapes his future reward landscape.”
And for societies, given that values are learned, the question becomes which parts of Goodness should we deliberately keep because they stabilize or improve the learning process, not merely because they protect cooperation equilibria?
In particular: the motivations that matter most for safe instruction-following are not the AI’s long-term consequentialist motivations (indeed, if possible, I think we mostly want to avoid our AIs having this kind of motivation except insofar as it is implied by safe instruction-following).
That seems like a reasonable position given that you accept the risk of long-term motivations. But it doesn’t seem to be what people are actually aiming for. In particular, people seem to aim for agentic AI that can act on a person’s behalf on longer time scales. And the trend predictions by METR seem to point to longer horizons soon.
I somewhat agree with your description of how LLMs seem to think, but I don’t think it is an explanation of a general limitation of LLMs. But the patterns you describe do not seem to me to be a good explanation for how humans think in general. Ever since The Cognitive Science of Rationality has it been discussed here that humans usually do not integrate their understanding into a single, coherent map of the world. Humans instead build and maintain many partial, overlapping, and sometimes contradictory maps that only appear unified. Isn’t that the whole point of Heuristics & Biases? I don’t doubt that the process you describe exists or is behind the heights of human reasoning, but it doesn’t seem to be the basis of the main body of “reasoning” out there on the internet on which LLMs are trained. Maybe they just imitate that? Or at least they will have a lot of trouble imitating human thinking while still building a coherent picture underneath that.
there is a typo here.