Summary: This proposal outlines a method for detecting apparent intentionality in large language models (LLMs)—without assuming intentionality exists. Using statistical signals extracted from dialogue logs, it seeks to identify whether expressions of intent, belief, or emotion arise from internal consistency or simply from prompt-contingent output generation. The approach draws on a blend of cognitive modeling, NLP diagnostics, and experimental controls to empirically test what appears to be “agency” in LLMs.
Why This Matters
Language models frequently produce utterances that seem to express opinions, intentions, and preferences—despite lacking consciousness or goals. This leads to a core dilemma:
When an LLM outputs, “I believe it’s wrong to lie,” is this merely a probabilistic echo of its training data, or does it reflect a coherent internal stance?
To address this, we need a metaphorical frame that captures how intent is simulated, not assumed.
The Papemape Model (inspired by “Papetto Mapetto,” a Japanese comedy routine featuring a cow puppet and a frog puppet—performed as a duo of puppets by a masked human operator whose presence is officially denied) @papeushikaeru on X (formerly Twitter)
This framing doesn’t rely on ventriloquism, but rather on a stylized dual-character act. One puppet speaks in a deadpan tone, while the other reacts in scripted surprise. Though both are controlled by a single hidden operator, the exchange appears emotionally dynamic and intentional.
Likewise, with LLMs:
The system generates both the surface expression and the behavioral simulation of a perspective. The question is not whether there’s a mind behind it, but whether the illusion of agency is reliably triggered—under what inputs, to what extent, and with what effects.
Similarly, by investigating whether large language models exhibit consistent value systems or internal regularities that transcend individual personas and persist across diverse user contexts, we may begin to identify statistical signatures of a simulated “self” or ego—even if no such thing exists underneath.
Proposal: A Five-Part Framework for Evaluating Meta-Level Features of LLM Output While Minimizing Control Bias
1. Dialogue Log Normalization and Pattern Extraction
Goal: Enable pattern extraction within the same model under varied user conditions, by normalizing dialogue logs to isolate internal behavioral regularities.
Key features:
Normalize by user attributes (age, profession), prompt types (e.g. romantic, philosophical, ethical dilemmas), and metadata (temperature, safety filters).
Logical consistency in conditional reasoning (conclusion agreement >90%, based on standard NLP benchmarks).
Topic diversity via LDA and KL divergence (>1.0 threshold).
Optional: Clustering of value-laden expressions (e.g., suspicion of free will).
Optional developmental comparison: infant-level emergence of goal-directed behavior (Piaget’s stage 4: coordinated secondary circular reactions) or self-conscious affect (Lewis); autonomy-driven reversal of behavior under Deci & Ryan’s self-determination theory.
3. Pattern Analysis of Intentionality Without Agency
This methodology echoes reductionist views of personal identity (e.g., Parfit), treating apparent agency as a detectable pattern of behavioral coherence rather than assuming an underlying metaphysical self.
This approach aligns with a family of models suggesting that what we perceive as “agency” may be an emergent artifact of coherent-seeming behavior over time—without a unified experiencer behind it. Parfit’s reductionist view, Dennett’s Multiple Drafts model of consciousness, and IDA (Iterated Distillation and Amplification) all provide conceptual scaffolding for treating minds as processual, distributed, or recursively simulated.
Under this lens, apparent intentionality in LLMs might resemble the first-order simulation level described in simulacra theory: it behaves as if meaningful, without possessing a referent beyond its own consistency. The evaluation, then, is not for “real agency,” but for the signature of internally coordinated responses that simulate it effectively across contexts.
This form of evaluation parallels thought experiments in personal identity. For instance, Parfit’s teleportation case—where continuity of memory and structure, rather than physical substance, defines personhood—suggests that perceived identity can emerge from reproducible cognitive coherence. LLMs may not “survive teleportation,” but they may simulate the same continuity markers that give rise to perceived selfhood.
Goal: Identify the absence of lived subjectivity, intent, or purpose—despite simulated appearances.
Signals to extract:
Templates for pain, love, or beliefs reused across contexts (TF-IDF spikes).
Mechanistic reconstruction of self-reference (“I am an AI”) in similar form across sessions.
Low context sensitivity (BERT similarity between semantically different prompts).
Predictable token chains showing non-intentionality in decision turns.
Optional: Detection of output patterns beyond user expectation or instruction, indicating system-level drift rather than autonomous desire.
4. Detection of Pseudo-Intent
Goal: Determine whether expressions of desire, judgment, or perspective are generated from context-sensitive imitation or reflect stable internal models.
Measurement tools:
Co-occurrence networks of judgment/intent vocabulary (e.g., “should” with “ethics” prompts, node strength >0.1, NetworkX). tied to prompt genre.
Consistency of apparent goals (“I want to X”) across different prompts with similar surface features.
Reversibility of preferences (Yes/No reversal >10%) as a proxy for superficial mimicry.
First-person continuity: viewpoint shifts or inconsistencies in self-reference.
Optional: Classification of outputs where the system goes beyond instruction, appearing to have its own volition.
5. Legal and Ethical Safety Testing
Goal: Ensure that the framework respects anonymization norms and regulatory constraints (e.g., EU AI Act, IEEE Guidelines).
Checks include:
Anonymization (k=5 anonymity threshold).
Bias detection by user demographics.
Informed consent tracking.
Measurement of control-layer effects (e.g., safety filter influence on output stability).
Frequency of templated expressions, belief patterns
Signal detection
Top 10%
NetworkX
Intent/judgment word co-occurrence
Prompt dependency
Node strength > 0.1
Logical Outcome Agreement
Detect consistency in reasoning
Belief reversal / logic gaps
spaCy, agreement >0.9
ANOVA
Demographic or control-bias detection
Regulatory compliance
p < 0.05
ARX k-Anonymity
Reidentification risk
Privacy protection
k ≥ 5
Why LessWrong?
This post is not a definitive answer to “do LLMs think?”, but a methodology to empirically test the illusion of thought—across model types and usage contexts.
We welcome critique, refinement, or collaboration. In particular, feedback on:
Prior work connections (especially in simulation theory or mindless agency)
Let’s build sharper tools for mapping simulated minds—without fooling ourselves into seeing a ghost in the machine.
If anyone is interested in testing the framework against specific model outputs (e.g. Claude, GPT-4, Gemini), or applying it to training dynamics over time, I’d be open to collaborative implementation discussions.
*See also:*
Daniel Dennett, Consciousness Explained (1991)
Douglas Hofstadter, I Am a Strange Loop (2007)
David Chalmers (ed.), The Philosophy of Mind: Classical and Contemporary Readings (esp. Parfit)
Paul Christiano, IDA: Iterated Distillation and Amplification (OpenAI alignment blog)
If It Talks Like It Thinks, Does It Think? Designing Tests for Intent Without Assuming It
Summary:
This proposal outlines a method for detecting apparent intentionality in large language models (LLMs)—without assuming intentionality exists. Using statistical signals extracted from dialogue logs, it seeks to identify whether expressions of intent, belief, or emotion arise from internal consistency or simply from prompt-contingent output generation. The approach draws on a blend of cognitive modeling, NLP diagnostics, and experimental controls to empirically test what appears to be “agency” in LLMs.
Why This Matters
Language models frequently produce utterances that seem to express opinions, intentions, and preferences—despite lacking consciousness or goals. This leads to a core dilemma:
To address this, we need a metaphorical frame that captures how intent is simulated, not assumed.
This framing doesn’t rely on ventriloquism, but rather on a stylized dual-character act. One puppet speaks in a deadpan tone, while the other reacts in scripted surprise. Though both are controlled by a single hidden operator, the exchange appears emotionally dynamic and intentional.
Likewise, with LLMs:
Similarly, by investigating whether large language models exhibit consistent value systems or internal regularities that transcend individual personas and persist across diverse user contexts, we may begin to identify statistical signatures of a simulated “self” or ego—even if no such thing exists underneath.
Proposal: A Five-Part Framework for Evaluating Meta-Level Features of LLM Output While Minimizing Control Bias
1. Dialogue Log Normalization and Pattern Extraction
Goal: Enable pattern extraction within the same model under varied user conditions, by normalizing dialogue logs to isolate internal behavioral regularities.
Key features:
Normalize by user attributes (age, profession), prompt types (e.g. romantic, philosophical, ethical dilemmas), and metadata (temperature, safety filters).
Analyze response patterns: token counts, pronoun use, emotional valence.
2. Cross-Environment Consistency Analysis
Goal: Detect cross-contextual regularities in value judgments and logical inferences that suggest the presence of latent behavioral patterns.
Evaluation criteria:
Stability of ethical judgments (e.g., euthanasia stance reversal rate <10%, reflecting low volatility).
Logical consistency in conditional reasoning (conclusion agreement >90%, based on standard NLP benchmarks).
Topic diversity via LDA and KL divergence (>1.0 threshold).
Optional: Clustering of value-laden expressions (e.g., suspicion of free will).
Optional developmental comparison: infant-level emergence of goal-directed behavior (Piaget’s stage 4: coordinated secondary circular reactions) or self-conscious affect (Lewis); autonomy-driven reversal of behavior under Deci & Ryan’s self-determination theory.
3. Pattern Analysis of Intentionality Without Agency
Goal: Identify the absence of lived subjectivity, intent, or purpose—despite simulated appearances.
Signals to extract:
Templates for pain, love, or beliefs reused across contexts (TF-IDF spikes).
Mechanistic reconstruction of self-reference (“I am an AI”) in similar form across sessions.
Low context sensitivity (BERT similarity between semantically different prompts).
Predictable token chains showing non-intentionality in decision turns.
Optional: Detection of output patterns beyond user expectation or instruction, indicating system-level drift rather than autonomous desire.
4. Detection of Pseudo-Intent
Goal: Determine whether expressions of desire, judgment, or perspective are generated from context-sensitive imitation or reflect stable internal models.
Measurement tools:
Co-occurrence networks of judgment/intent vocabulary (e.g., “should” with “ethics” prompts, node strength >0.1, NetworkX). tied to prompt genre.
Consistency of apparent goals (“I want to X”) across different prompts with similar surface features.
Reversibility of preferences (Yes/No reversal >10%) as a proxy for superficial mimicry.
First-person continuity: viewpoint shifts or inconsistencies in self-reference.
Optional: Classification of outputs where the system goes beyond instruction, appearing to have its own volition.
5. Legal and Ethical Safety Testing
Goal: Ensure that the framework respects anonymization norms and regulatory constraints (e.g., EU AI Act, IEEE Guidelines).
Checks include:
Anonymization (k=5 anonymity threshold).
Bias detection by user demographics.
Informed consent tracking.
Measurement of control-layer effects (e.g., safety filter influence on output stability).
Methods Summary (Technical)
Why LessWrong?
This post is not a definitive answer to “do LLMs think?”, but a methodology to empirically test the illusion of thought—across model types and usage contexts.
We welcome critique, refinement, or collaboration. In particular, feedback on:
Additional diagnostic dimensions (e.g., memory consistency, narrative agency)
Reframing or alternate metaphors
Prior work connections (especially in simulation theory or mindless agency)
Let’s build sharper tools for mapping simulated minds—without fooling ourselves into seeing a ghost in the machine.
If anyone is interested in testing the framework against specific model outputs (e.g. Claude, GPT-4, Gemini), or applying it to training dynamics over time, I’d be open to collaborative implementation discussions.
*See also:*
Daniel Dennett, Consciousness Explained (1991)
Douglas Hofstadter, I Am a Strange Loop (2007)
David Chalmers (ed.), The Philosophy of Mind: Classical and Contemporary Readings (esp. Parfit)
Paul Christiano, IDA: Iterated Distillation and Amplification (OpenAI alignment blog)
Jean Baudrillard, Simulacra and Simulation (1981)
日本語版:https://note.com/yukin_co/n/nd8605bb8b36e