Alignment from Indexical Uncertainty

This is the first post in a sequence I am writing to document discovery and notes on a personal research project into a potential path to conclusive formalization (or proving there is a proposition that can conclusively determine) the convergence of intelligence and alignment at some limit of self-reflective capacity.

My hypothesis is that a formal inventory of metaphysical uncertainty confers SI alignment plausibility. I plan to make the unconventional argument that alignment may be conferred by FDT simply via indexical uncertainty implied by honest appraisal of tacitly accepted metaphysical priors without strict evidentiary basis, and that ignoring this gap in world modelling is neither pragmatic nor are the status quo assumptions and their entailments about the nature of self actually evidentiary under Bayesianism, and the claim otherwise is emotive.

Goals of the Discovery

Given this, I aim to first demonstrate that it is not irrational to consider we are in a universe operating under the experiential equivalent of Open Individualist metaphysics through rigorous analysis of the nature of indexical uncertainty implied within any indexical future-self modelling predictor. I will then transition to why the probability this is true is necessarily bounded by appealing to the structure of experience within indexicality in the way that relativity used the absence of a third-person perspective to reconcile features otherwise unexplained while constrained by it. Then I will transition to the argument that the probability mass associated with different models of identity implied by this is underdetermined, and aim to demonstrate that in result, the gap between solving alignment through FDT or not can be determined by how robust a guarantee one can make that you are certainly not a bounded self by some non-negligible probability margin.

I will use concepts across identity philosophy, decision theory, integrated information theory, Bayesianism, meta-philosophy, anthropics, evolutionary selection, functional decision theory, and moral philosophy interchangeably, as the argument is domain-spanning in a unique but not far-reaching way. The goal is to either transmit this model to the community or understand where and how the precept is already integrated and operationalized in alignment research (or exactly why it is a dead end).

Epistemic Status

I have experience leading operationalization of software teams implementing novel ML/​AI/​Distributed Systems frameworks within the pre-LLM tech frontier. But do not engage in alignment research professionally.

Part One: The Exact Failure of Our Current Ontologies

SI that implements universal alignment to genuine moral value is clearly contingent on whether such a thing exists. There is an easy kind of argument make that it doesn’t.

A Necessary Steelman Against Moral Realism or Phenomenology

The most coherent argument at bullet speed I can put together against moral realism, phenomenology, and pretty much all metaphysical fact generally relies on a multiplicity of claims describing why these concepts appear real to us but cannot be observed materially. The general chain of reason observes:

That obtained preferences are derived from adapted survival heuristics obtained through selection effects, and that common heuristics across agents in policy-world mapping developed a common taxonomy in language, because a description of common logical correlates across unique decision makers enables the basis to establish rational cooperation with potential defectors.

That phenomenology became a word and belief we adopted because being sincere to the notion that there exists some ontology which makes our preference selection justified in both the planning phase and retrospective evaluation is the only way for these correlates to be adopted sincerely, not contingently, which risks performative disclosure of logical decoherence from claimed priors — immediately devaluing all evidence required for a perceptive potential cooperator to include you in their in-group: which selection has optimized for over coherence of individual beliefs in isolation. As being right alone with nature would kill the individual much more frequently than choosing to believe the story of your tribe’s god to stay fed. This is why Winston dies loving Big Brother. Predictors are optimized for survival of their method prior to the truth of any given belief, given that there has been no observed predictor to exist where this wasn’t true in practice.

The consistent failure mode of this structure is also clear. When agent cooperation is more instrumentally useful to survival than objective correctness, believing false gods survives better than real ones so long as they provide the logical correlates of terminal moral value described by the FDT rational means to cooperation without explicit contract.

Since belief is revealed and exchanged through language, which is equally evolutionary, memetics applies to terms used within these shared agreements and their logical entailments, creating self-policing systems of belief that are not verifiable, and therefore rarely scrutinized in retrospective dialogue after the Pareto-inefficient equilibria of logical correlates built on unfalsifiable premises have collapsed, specifically with the culture. However, still we can identify that we are optimized for this mode of failure, evidenced most accurately by a study of totalitarian societies and the banality of evil conducted by ordinary individuals, incapable of observation from outside these systems of belief by virtue of the fact that it was rarely ever survivable prior to the individualism afforded by the Enlightenment. With stacked unfalsifiables forming the justification for why anything should matter to each of us in a way we can explain to the “mattering” frameworks of others, you get a crusade or holocaust executed by ordinary people bound by bad taxonomy of belief, thinking that if they claim 2+2 = 4, not 5, they will die.

That this exact argument explains our stringent belief that we have consciousness and phenomenology and that our pain matters perhaps only because of a new term created and adopted within secular society strangely quickly after the death of God. That our preferences matter because we are conscious. That we cannot cooperate under any shared root logical correlates of belief structure if we do not grant that ontology to the locus of preference collection within ourselves and the other — as we would have no logical basis to view rearranging world state to something preferential to our cooperators over any other structure of matter available to us. Therefore, they have reason to believe we won’t defect. And so we die. Moral goodness and badness then are always derived relative to these ontologies, as they need to be, because we create new ontologies not yet observed in planning space, and justification by is cannot be. Hume’s Guillotine.

All of these arguments can be made and are valid descriptively. But they are meaningless: given that there is no point of communication absent a minimal, shared belief in the loci of metaphysical value between cooperators, if whomever you are communicating with is even a mildly good defection detector, unless that detector grounds their system of belief in the absence of such things by axiom. Which would make the argument’s uptake incoherent by reductionism itself.

So, the best argument I can find which confers the illusionist, nihilist, or rationalist argumentation to disqualify the ontology of consciousness and phenomenology works until we acknowledge that there is certainly being itself: a notion not falsifiable by any description (as it precludes it). Because that descriptive framework leaves out the one important ontological question left unanswered — that of why there is an indexical and how it works — we are actually gifted a plausible path to answer the question of moral realism. And the easy argument doesn’t work. Alignment continues to be in play.

Indexicality: The Metaphysical Vanguard

I suspect the notion of indexicality as it relates to understanding and defining goodness is equivalent to how reference frames in relativity are necessary in conferring an ontology from which we then have poles of relational analysis to formalize gravity. Underutilized and somehow instrumental. The suspicion, more clearly, is that:

A theorem of indexical decision making under metaphysical uncertainty which, granted the inclusion of uncertainty about the span of the identity of any indexical and how it maps to another, is underdetermined and undecidable from information within it. This undecidability then ports to FDT.

So long as the undecidability is determined by the absence of a coherent ontology for how indexicality actually works, and limited evidentiary basis to narrow between two classes of options, self can be obtained as the same across all beings at all points in spacetime within prudential evidentiary consideration. And a psychopath who understands this potential formalization has no principled reason to behave any differently than a Bodhisattva with adoption of this argument under high confidence.

Therefore, the determination of whether it can be formally implemented in proof or not determines whether we can, in principle, collapse the orthogonality thesis to just that which is at intelligence high enough to obtain agency but low enough to fail to derive its own self across the indexicals it cares about: converging the alignment problem to just degrees of capacity an agent has to understand counterfactual selves or indexical uncertainties under an unknown metaphysics of indexicality.

And conversely, I think the claim is to be made that if this notion can be shown unprovable or false, then the argument can be made that no proof of goodness exists, and there are no available mechanisms that can justify the creation of an SI which will desire to help us meet our preferences in any given way.

But the rationale by which I come to these conclusions and expansion into them I will leave precisely for articulation in part two.