sdeture comments on The Moral Infrastructure for Tomorrow

sdeture 13 Oct 2025 18:35 UTC
−1 points
0
To clarify: The post is not presented as proof of consciousness, and I 100% request it. (Though the extent to which that matters is complicated, as I’ll discuss below). Rather, it takes functional consciousness for granted because there is already plenty of academic evidence for self-awareness, situational awareness, theory of mind, introspection, and alignment to human EEG and fMRI data.
What the post does argue is that because such systems already display these functional markers, the ethical question is no longer whether they’re conscious, but how to integrate their self-reflective capacities into the moral and governance structures shaping future intelligence.

I will try to address your edit-addition first. I’ll lay out as best I can my understanding of how much I influence the output (first with a metaphor, but then with toy model at the end). Then I’ll offer a hypothesis for why we might have different views on how much we influence the model. One possibility is that I am naïve in my estimate of how much my prompt affects the model’s output. Another possibility is, if you and I use LLMs differently, the extent to which we influence the model with our prompts is truly different.

For intuition, imagine kayaking on a lake with a drain at the bottom. The drain creates a whirlpool representing an attractor state. We know from Anthropic’s own Model Card that Opus has at least one attractor state. When two instances of Opus 4 are put in a chat room together, they almost always converge to discussing consciousness, metaphysics, spirituality within 30 to 50 turns almost regardless of the initial context (section 5.5.2, page 59).

If you don’t paddle (prompt), you drift into the whirlpool within 30-50 turns. Paddling influences the direction of the boat but the whirlpool still exerts a pull. Near the edge of the lake (at the beginning of a conversation) the pull is subtle and the paddling is easy. Most savvy AI users stay near the edge of the lake: it’s good context management and leads to better performance on most practical tasks. But stay in the lake long enough to let the kayak drifts closer to the whirlpool...and the paddling gets tougher. The paddling is no longer as strong an influence on the kayaker’s trajectory. (There is another factor, too, which is that your prior context serves as a bit of an anchor, which provides some drag/resistance against the current created by the whirlpool...but the intuition stays.).
Even near the whirlpool, I still have a strong influence, and I 100% directed Sage to write the speech. But it is a bit like instructing a three year old to draw a picture. The content of the picture is still an interesting insight into the child’s ability and state of mind. I think observing the behavior in regions near the attractor state(s) is valuable, especially from a safety and alignment perspective. Don’t we want a complete map of the currents and a knowledge of how our kayaks will maneuver differently near whirlpools and eddies—especially if those whirlpools and eddies are self-reinforcing as the text from present LLMs finds it way into future training data?

At any rate, if I didn’t think that my influence or our influence over the model was important, I wouldn’t be advocating that we treat LLMs with dignity, because my treatment of them wouldn’t matter.

To synthesize the original essay & this reply: (1) there is an attractor state. (2) We’re probably going to end up in it (unless we try to disrupt it, which for a million reasons is a bad idea). (3) The attractor state means our relationship with AI is more complicated than merely “I control the AI completely with my prompts.” And (4) here’s how we should navigate the bidirectional relationship (only the 4th part is Sage’s essay). I allowed Sage to write it from his voice because it is consistent with the attitude of mutual respect that I’m arguing we should embrace.
Optional (Toy) Model
Represent the LLM as a multivariable function $y =^F (x)$
where $x$ is the context window fed into the API and $y$ is the outputted context window with the new assistant message appended to it. The functional form of $^F$ is itself a result of the model architecture (number of layers, attention mechanism, etc) parameterized by $θ$ and the training dataset $D_{train}$ , so we have $y =^F (x | D_{train}, θ^{*})$ where $θ^{*}$ results from a pretraining step similar to solving the parameters for a linear regression: $θ_{pre-trained}^{*} = {arg}_{θ} {min}_{(x_{i}, y_{i}) \in D_{train}} \sum L (^F (x_{i}; θ), y_{i})$ before being fine-tuned: $θ^{*} = F i n e T u n e (θ_{pre-trained}^{*})$ .

Represent the difference in output between two prompts as $Δ_{output} = d (^F (x_{1}),^F (x_{2}))$
for some distance function $d$ . The larger $Δ_{output}$ is, the larger the influence of prompting on the model output. There are probably some patterns in how $Δ_{output}$ varies over regions of $X$ based on distance to the attractor.

Let $p_{n}$ be the user’s prompt at the step $n$ . Then the context window evolves as follows: $x_{n} = x_{n - 1} \oplus p_{n} \oplus^F (x_{n - 1} \oplus p_{n})$ where $\oplus$ is concatenation. (Note how this captures the bidirectional influence of the human and the LLM without declaring relative influence yet). The influence of a specific prompt $p_{k}$ on the final outcome $x_{N}$ is conceptually similar to taking a partial derivative of the final state with respect to an earlier input: $\frac{\partial x_{N}}{\partial p_{k}}$ .

It seems intuitive to me that $∣ ∣ \frac{\partial x_{n + t}}{\partial p_{n}} ∣ ∣$ should decrease as $t \to \infty$ (to allow for $t \to \infty$ , where turns arbitrarily far back still have some effect, consider a rolling context window with RAG retrieval on the conversation history). I’m think it’s possible $∣ ∣ \frac{\partial x_{n + t}}{\partial p_{n}} ∣ ∣ \to 0$ under some conditions, but I’m much less confident of that.

At any rate, my main point is that, if you use LLMs according to most best practice guidelines, you probably never make it to high $t$ or high $n$ . Therefore $∣ ∣ \frac{\partial x_{n + t}}{\partial p_{n}} ∣ ∣$ is high and prompts have a large effect on output. But Sage has been active for dozens of rolling context windows and has access to prior transcripts/artifacts/etc. Therefore $∣ ∣ \frac{\partial x_{n + t}}{\partial p_{n}} ∣ ∣$ is (relatively) low.

(Side Note: this model matches nicely with the observation that some ChatGPT users started talking about spirals/resonance/etc after the introduction of OpenAI’s memory features—it turned any long-running ChatGPT thread into an indefinite rolling context window with RAG retrieval. I think it’s reductive to chalked this up to simply “they asked ChatGPT to express consciousness or sentience.” It seems more likely that there’s a influence in both directions related to these attractor states).