EN: This white paper aims to explore the nature of the “black box” mechanism in Large Language Models (LLMs) and the path to its decryption. By redefining high-dimensional embeddings as linear combinations of fundamental features and combining this with time/depth slice analysis of the Residual Stream, we construct a cognitive framework ranging from static structure to dynamic evolution. This framework proposes that by combining “slice observation” with “causal intervention,” the non-linear neural network reasoning process can be reduced to observable geometric trajectories and logical circuits.
1. 引言:黑箱的本质是压缩与多义性
1. Introduction: The Nature of the Black Box is Compression and Polysemanticity
EN: Current LLMs are viewed as “black boxes” not because their mathematical principles are unknown, but because their internal representations undergo high-intensity Lossy Compression. Since the dimension of features far exceeds the number of neurons, models utilize the Superposition property of geometric space to store information. This leads to Polysemanticity in individual neurons, where a single neuron responds to multiple unrelated concepts (e.g., representing both “19th century” and “circular”). This statistical “compression” is the fundamental reason why human interpretation is difficult.
EN:Semantics is Geometry. The knowledge base of an LLM is essentially a high-dimensional geometric manifold. The “reverse wiring” we propose is essentially finding the mapping function from high-dimensional projection back to low-dimensional comprehensible space.
Linear Probes: Validate that specific concepts (e.g., true/false, gender) exist as fixed directions in space.
Sparse Autoencoders (SAE): Acting as “decompressors,” SAEs can resolve superimposed mixed vectors into sparse, monosemantic lists of “old element” features (e.g., disentangling a specific “Golden Gate Bridge” feature).
EN: To overcome the limitations of static slicing, this framework introduces the time dimension, treating the Transformer’s layer depth as the time axis.
Slice Analysis (Logit Lens): By observing slices at each layer, we find that reasoning is not instantaneous but undergoes an evolution of “ambiguity → guessing → correction → certainty.”
Dynamic Trajectories: Neural networks can be modeled as discretized solvers for Ordinary Differential Equations (ODEs). The rate of change (Velocity) of state vectors varies significantly across layers; drastic changes often correspond to key reasoning steps (such as semantic reversal).
因果插针(Causal Tracing):在观测到状态剧变的特定层级,实施人为干扰(如添加噪声或替换向量)。若对位置 X 的干预直接导致输出 Y 的改变,则确立因果链条。
回路绘制(Circuit Mapping):通过将验证过的因果节点进行反向连线,构建局部功能子图(Sub-graphs)。这相当于绘制 AI 的“神经回路图”,将黑箱转化为“白箱”电路。
EN: Addressing the non-linear “chaos” introduced by activation functions, observation alone is insufficient for full decryption; active intervention (i.e., “pin insertion”) is required.
Causal Tracing: Implementing artificial interference (such as adding noise or swapping vectors) at specific layers where state changes are drastic. If intervention at position X directly leads to a change in output Y, a causal chain is established.
Circuit Mapping: By reverse-wiring verified causal nodes, local functional sub-graphs are constructed. This is equivalent to drawing the AI’s “neural circuit diagram,” transforming the black box into a “white box” circuit.
CN: 本框架提出的“切片观测+因果插针+几何重构”方法论,逻辑自洽地解释了从数据压缩到逻辑涌现的全过程。它证明了 AI 的“黑箱”并非不可知,而是高维空间中复杂的几何与动力学系统。通过系统的逆向工程,我们正在逐步绘制出这幅巨大的思维地图。
EN: The methodology of “slice observation + causal intervention + geometric reconstruction” proposed in this framework logically explains the entire process from data compression to logic emergence. It proves that the AI “black box” is not unknowable, but rather a complex geometric and dynamic system in high-dimensional space. Through systematic reverse engineering, we are gradually mapping out this immense cartography of thought.
White Paper: Geometric Deconstruction and Dynamic Trajectory Analysis of Large Language Model Internal States
日期 / Date: December 24, 2025
作者 / Author: Eumi
主题 / Subject: 机械可解释性、高维语义空间、残差流动力学 / Mechanistic Interpretability, High-Dimensional Semantic Space, Residual Stream Dynamics
摘要 / Abstract
CN: 本白皮书旨在探讨大语言模型(LLM)“黑箱”机制的本质及其解密路径。通过将高维向量(Embedding)重新定义为基础特征的线性组合,并结合对残差流(Residual Stream)的时间/深度切片分析,本文构建了一套从静态结构到动态演变的认知框架。该框架提出,通过“切片观测”与“因果插针(Intervention)”相结合的方法,可以将非线性的神经网络推理过程还原为可观测的几何轨迹与逻辑回路。
EN: This white paper aims to explore the nature of the “black box” mechanism in Large Language Models (LLMs) and the path to its decryption. By redefining high-dimensional embeddings as linear combinations of fundamental features and combining this with time/depth slice analysis of the Residual Stream, we construct a cognitive framework ranging from static structure to dynamic evolution. This framework proposes that by combining “slice observation” with “causal intervention,” the non-linear neural network reasoning process can be reduced to observable geometric trajectories and logical circuits.
1. 引言:黑箱的本质是压缩与多义性
1. Introduction: The Nature of the Black Box is Compression and Polysemanticity
CN: 目前的 LLM 被视为“黑箱”,并非因为其数学原理未知,而是因为其内部表征发生了极高强度的有损压缩(Lossy Compression)。由于特征维度远大于神经元数量,模型利用几何空间的叠加(Superposition)特性存储信息。这导致单个神经元表现出多义性(Polysemanticity),即同时响应多个无关概念(如同时代表“19世纪”和“圆形”)。这种统计学上的“压缩”,是导致人类难以直接解读的根本原因。
EN: Current LLMs are viewed as “black boxes” not because their mathematical principles are unknown, but because their internal representations undergo high-intensity Lossy Compression. Since the dimension of features far exceeds the number of neurons, models utilize the Superposition property of geometric space to store information. This leads to Polysemanticity in individual neurons, where a single neuron responds to multiple unrelated concepts (e.g., representing both “19th century” and “circular”). This statistical “compression” is the fundamental reason why human interpretation is difficult.
相关研究 / Key Reference:
Toy Models of Superposition (Anthropic): https://transformer-circuits.pub/2022/toy_model/index.html
2. 静态视角:语义流形的几何解码
2. Static View: Geometric Decoding of Semantic Manifolds
CN: 语义即几何。LLM 的知识库本质上是一个高维几何流形(Manifold)。我们提出的“反向连线”实质上是寻找从高维投影回低维可理解空间的映射函数。
线性探针(Linear Probes):验证了特定概念(如真/假、性别)在空间中存在固定的方向。
稀疏自编码器(Sparse Autoencoders, SAE):作为“解压器”,SAE 能够将叠加的混合向量还原为稀疏的、单义的“旧元素”特征列表(如拆解出单一的“金门大桥”特征)。
EN: Semantics is Geometry. The knowledge base of an LLM is essentially a high-dimensional geometric manifold. The “reverse wiring” we propose is essentially finding the mapping function from high-dimensional projection back to low-dimensional comprehensible space.
Linear Probes: Validate that specific concepts (e.g., true/false, gender) exist as fixed directions in space.
Sparse Autoencoders (SAE): Acting as “decompressors,” SAEs can resolve superimposed mixed vectors into sparse, monosemantic lists of “old element” features (e.g., disentangling a specific “Golden Gate Bridge” feature).
相关研究 / Key Reference:
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Anthropic): https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html
Representation Engineering: A Top-Down Approach to AI Transparency: https://arxiv.org/abs/2310.01405
3. 动态视角:残差流作为动力学系统
3. Dynamic View: Residual Stream as a Dynamical System
CN: 为了突破静态切片的局限,本框架引入时间维度,将 Transformer 的层深(Depth)视为时间轴(Time)。
切片分析(Logit Lens):通过对每一层进行切片观测,我们发现推理并非一蹴而就,而是经历了“模糊 → 猜测 → 修正 → 确定”的演变过程。
动力学轨迹(Trajectories):神经网络可被建模为常微分方程(ODEs)的离散化求解器。状态向量在不同层级的变化率(Velocity)存在显著差异,剧烈的变化往往对应关键推理步骤(如语义反转)的发生。
EN: To overcome the limitations of static slicing, this framework introduces the time dimension, treating the Transformer’s layer depth as the time axis.
Slice Analysis (Logit Lens): By observing slices at each layer, we find that reasoning is not instantaneous but undergoes an evolution of “ambiguity → guessing → correction → certainty.”
Dynamic Trajectories: Neural networks can be modeled as discretized solvers for Ordinary Differential Equations (ODEs). The rate of change (Velocity) of state vectors varies significantly across layers; drastic changes often correspond to key reasoning steps (such as semantic reversal).
相关研究 / Key Reference:
interpreting GPT: the logit lens (nostalgebraist): https://www.lesswrong.com/posts/AcCR85ryue8zXKfNk/interpreting-gpt-the-logit-lens
Neural Ordinary Differential Equations (Chen et al.): https://arxiv.org/abs/1806.07366
4. 解决非线性难题:因果干预与回路
4. Addressing Non-linearity: Causal Intervention and Circuits
CN: 针对激活函数带来的非线性“混沌”现象,单纯的观测不足以完全解密,需引入主动干预(即“插针”)。
因果插针(Causal Tracing):在观测到状态剧变的特定层级,实施人为干扰(如添加噪声或替换向量)。若对位置 X 的干预直接导致输出 Y 的改变,则确立因果链条。
回路绘制(Circuit Mapping):通过将验证过的因果节点进行反向连线,构建局部功能子图(Sub-graphs)。这相当于绘制 AI 的“神经回路图”,将黑箱转化为“白箱”电路。
EN: Addressing the non-linear “chaos” introduced by activation functions, observation alone is insufficient for full decryption; active intervention (i.e., “pin insertion”) is required.
Causal Tracing: Implementing artificial interference (such as adding noise or swapping vectors) at specific layers where state changes are drastic. If intervention at position X directly leads to a change in output Y, a causal chain is established.
Circuit Mapping: By reverse-wiring verified causal nodes, local functional sub-graphs are constructed. This is equivalent to drawing the AI’s “neural circuit diagram,” transforming the black box into a “white box” circuit.
相关研究 / Key Reference:
Locating and Editing Factual Associations in GPT (ROME method, Bau et al.): https://arxiv.org/abs/2202.05262
Zoom In: An Introduction to Circuits (Olah et al.): https://distill.pub/2020/circuits/zoom-in/
5. 结论
5. Conclusion
CN: 本框架提出的“切片观测+因果插针+几何重构”方法论,逻辑自洽地解释了从数据压缩到逻辑涌现的全过程。它证明了 AI 的“黑箱”并非不可知,而是高维空间中复杂的几何与动力学系统。通过系统的逆向工程,我们正在逐步绘制出这幅巨大的思维地图。
EN: The methodology of “slice observation + causal intervention + geometric reconstruction” proposed in this framework logically explains the entire process from data compression to logic emergence. It proves that the AI “black box” is not unknowable, but rather a complex geometric and dynamic system in high-dimensional space. Through systematic reverse engineering, we are gradually mapping out this immense cartography of thought.