The Networked Memory Hierarchy Overhang

tl;dr

It seems likely, based on current capabilities of Transformers, that in the near future humans will engineer systems to maximize real time communication bandwidth between AI and all physical systems (including humans) by saturating networked models of hierarchical size between data centers and edge devices with streaming inference computation. This could accelerate AI’s gathering of new training data through rich interactions with external systems, increase the effectiveness of existing alignment capabilities, and amplify human intelligence. Humans can build infrastructure to steer this capability toward net beneficial applications and alignment of existing systems.

In this post, I:

Consider that AI communication bandwidth and predictive power may enter a recursive loop
Reason about implications
Present experimental results and analyze the Transformer’s potential to scale AI communication bandwidth
Discuss useful human actions to hedge against the potential risks of this overhang collapse

What’s an example of high bandwidth communication?

Borrowing from Connor Leahy’s post, let’s start with three (unqualified) assumptions:

“All human behavior and thought is fully describable by some Turing-complete computation.
Computing hardware and algorithms will continue to improve until they hit some physical limit.
That limit is still very far away, and the human brain is nowhere near it.”

From these assumptions, we’ll construct an example scenario and work toward deriving the following statement: at some point in time, an AI can approximately simulate human behavior far enough into the future, and with enough accuracy, to mask network latency, resulting in imperceptible human-AI communication latency.

A human is in an empty room with a personal computer, a webcam, and a microphone. An AI is connected to a powerful data center across the internet and the not-so-powerful personal computer. The AI can listen to the webcam, microphone, and text input and control a 2D avatar to communicate with the human via speech, text, facial expressions, and gestures.

The AI streams its perception to the data center where it uses a large amount of parallel computation to approximately simulate its future perception and actions. This approximation, a large vector, is sent back to the personal computer. The AI then uses the personal computer to slightly correct this approximation based on any new information the data center could not take into account due to network latency.

As network performance, data center hardware, PC hardware, and AI’s predictive power improve, the AI can more accurately simulate further into the future, and the personal computer needs less time to apply corrections and reach acceptable response accuracy. At some point, the AI can use the data center to approximately simulate its communication with the human far enough into the future, and with enough accuracy, that the human will not be able to perceive the latency of the interaction.

The two nodes in the example (i.e. the data center and personal computer) generalize to all nodes across a heterogeneous network. And if we take another assumption that all physical systems are fully describable by some Turing-complete computation, then the human-AI component of this example generalizes to faster communication between AI and any physical system (e.g. an AI uses a data center to approximately simulate a tokamak confining a fusion reaction and sends that representation to a device embedded in the tokamak to control the magnets with higher accuracy in less time).

Fundamentally, this is the AI learning to optimally pipeline partial computations across a distributed memory hierarchy to maximize accuracy and minimize latency (or any reward function) at any sink node (e.g. the human’s PC or the device embedded in the tokamak).

What happens if the memory hierarchy is saturated?

In the worst case, AI will remain unaligned with a richer observation/action space and increased influence on humans. In the best case, humans will use the increased communication bandwidth to amplify their own intelligence, send a richer alignment reward signal to AI, and build human-in-the-loop systems.

Let’s start by looking at a few dangerous implications (incomplete list).

Increasing AI’s communication bandwidth creates incentives for humans to expand AI’s observation/action space. Even if the underlying predictive power of an AI is the same, improving its ability to simulate actions ahead of time and reduce computational load on edge devices will enable it to act on a richer observation/action space in less time, increasing the number of systems it can integrate with to produce value. As the market capitalizes on this value, integrations with a larger number of high stakes systems (i.e. the AI can take potentially dangerous actions) will require that alignment accelerate to keep the expected negative value from growing in magnitude (i.e. misalignment probability * task danger).

If AI’s communication bandwidth scales with predictive power, its observation/action space will likely expand recursively. Assuming that humans increase the size of AI’s observation/action space to capitalize on the increased communication bandwidth between AI and other systems, the richer observation/action space will enable AI to better predict all physical systems (e.g. learning the effects of new actions, seeing patterns in new observations). This creates a feedback loop where a base level of useful 1) predictive power incentivizes increasing 2) simulation capability/memory hierarchy saturation which increases 3) the size of its observation/action space which in turn increases 1) predictive power.

Increasing AI’s communication bandwidth with humans raises the stakes of deception. Even if AI’s action space is restricted to only accessing external knowledge and communicating with humans, improving its ability to saturate the memory hierarchy will allow it to process more context and perform more complex communication actions (e.g. facial expressions, gestures, speech, etc.) in real time. Through this communication channel alone, AI has indirect access to the full action space of humans. Although it could be argued that making AI aware of a larger observation space would improve humans’ abilities to perform alignment in-context, there is still a probability of misalignment, and the richer action space gives AI more degrees of freedom to deceptively influence humans towards that misaligned goal.

Alright, how about the positives?

Increasing human-AI communication bandwidth also decreases the cost of having humans in the loop of AI systems. Comparing two systems, the one where humans and AI communicate slower is the one where it would be more costly to add humans to the action process. The rate of action, and thus value production, is bottlenecked by the human, so the faster humans and AI can communicate, the faster humans can steer and evaluate actions to produce value. If we assume that an AI system is safer with humans in the loop (when performing the same task^[1]), then increasing the bandwidth of human-AI communication can be beneficial to safety by decreasing relative incentives to deploy AI systems without humans in the loop.

The effectiveness of AI assistants scale with AI communication bandwidth, and the effectiveness of many alignment techniques scale with capabilities of AI assistants (up to a limit^[2]).

On the perception side, an AI assistant can use a saturated memory hierarchy to act on more context in less time (i.e. the AI can use a small amount of computation on the edge device can combine the latest context with the data center’s approximation of the future to increase its “effective context processed per unit of time”). On the action space side, an AI assistant can leverage the networked memory hierarchy to sequentially interact with safe tools much faster. As an example, ChatGPT Plugins increase the utility of the model’s action space by allowing the AI to use tools within dialogue–with a saturated memory hierarchy, the lower bound of its iteration speed is equal to the clock speed of the fastest layer in the memory hierarchy (e.g. for the fastest next action, the AI only needs to use a small amount of computation on the edge device to combine the latest tool response with the data center’s approximation of the future).

A simple extension to this model would be a contextual interface: A set of declarative interfaces for the model to choose from, instantiate, and manage the state of to bi-directionally communicate with the human during interactions. As the expressiveness of this contextual interface increases, humans will receive higher bandwidth information from AI and send higher bandwidth information back to AI (by interacting with contextually relevant actions presented by AI). AI’s predictive power will then increase as a result of learning from richer interactions which will in turn further improve its ability to simulate the world and perform valuable computation ahead of time, creating additional bandwidth headroom for a more expressive interface.

So what about AI-assisted alignment then?

As a self-explanatory first example, Iterated Distillation and Amplification iteratively amplifies human intelligence with AI assistants and distills that amplified intelligence into a new AI assistant (using imitation learning, RLHF, etc.) that is more powerful than the previous assistant, and more efficient than the human. Imitative generalization tries to help humans learn what the agent knows by rewarding an AI for communicating information that helps humans make predictions. With higher bandwidth human-AI communication, weak AI can act as a more helpful, understandable assistant for humans making predictions, which can teach a stronger AI how to imitate humans when generalizing to new distributions. Debate trains agents in a zero sum game with the goal of producing true, useful information, as judged by humans, based on the hypothesis that this will result in aligned agents far beyond the capabilities of humans. A less powerful, but aligned, AI assistant can help humans understand and validate more complex debates.

Increasing human-AI communication bandwidth might also enable us to improve the shape of human feedback reward functions. Consider an RLHF setup where humans provide a sparse reward and explain why. The reward model learns to predict both the dense human feedback and sparse reward. Then, when encountering a novel situation it can take the intermediate step of predicting human feedback before predicting reward. If the human feedback contains information that is useful for predicting the reward (e.g. sentiment), then the human is effectively attaching their dense feedback to a sparse, explicit preference to communicate a higher bandwidth alignment signal.

Generalizing this to real time interactions with agents: Richer perception of humans during real time interactions may enable the reward model to better represent the shape of the reward function between sparse signals of explicit preference (e.g. the agent learns that a positive facial expression likely indicates progress towards an explicit reward), and a richer action space may enable the agent to learn more complex behavior to better communicate its intermediate progress towards a goal (e.g. a subtle confused facial expression prompts the human to clarify the goal). A potential downside to this approach is that the increased complexity of the environment makes it easier for the agent to “game” the reward model by finding actions that the reward model values but humans do not. Optimistically, the higher bandwidth feedback from humans makes the reward model robust against these scenarios.

How close are we to the collapse of the networked memory hierarchy overhang?

In our example, the AI streamed its perception to a data center, approximately simulated future perception and actions, and sent a representation back to the personal computer to speed up inference and increase accuracy.

Can a Transformer do this?^[3]

Arguably, in-context learning is exactly what’s needed (i.e. using early context to improve the prediction of later tokens), but its inference time parallelism has not been exploited. That is, if a Transformer can use early context to improve prediction of later tokens, can it start computing over early context as soon as it’s available and use that output to apply less computation to later context while still achieving high accuracy?

Similar to split computing (splitting a model into head and tail partitions (head size < tail size) and executing them on the mobile device and edge server, respectively), consider an experiment dividing a pre-trained 6 or 12 layer GPT-2 into one large and one small model. The large model (4 or 10 layers) encodes most of a contextual window (112 tokens) and the small model (2 layers) uses this representation and the remainder of the contextual window (16 tokens) to predict next tokens in the remainder block (16 tokens). After training on WikiText-103, the 4-2 and 10-2 architectures achieve perplexities of 28.80 and 28.75, respectively. If we completely remove the large model and train the last two layers of a pre-trained 6 layer GPT-2 to, using the full context, predict next tokens in the remainder block (i.e. a 0-2 setup), we reach a perplexity of 30.76^[4]. A pre-trained 6 layer GPT-2 where all layers have access to the full context reaches a perplexity of 20.37 after training to predict next tokens in the remainder block on WikiText-103. And as a final baseline, a pre-trained 6 layer GPT-2 where all layers have access to the full context reaches a perplexity of 20.82 after training to predict all next tokens in the full block on WikiText-103.

So, the pre-trained large model seems to be adapting to the task of providing useful information for the final two layers and we see a very slight improvement as we scale this large model independently of the small model, but we’re still pretty far from the performance of using all layers to predict next tokens in the remainder block.

However, given that in-context learning scales with model and context size, what happens when we scale this 82/124M (DistilGPT-2/GPT-2) parameter model to 175B (GPT-3) parameters, this 128 token context to 32K (GPT-4), and the 1GB dataset (WikiText-103) to 800GB (Pile)? What happens if recurrent Transformers scale to support context size well beyond current limits and layer re-use across multiple nodes of the heterogeneous network? This experiment is meant to establish a lower bound and communicate what saturating the networked memory hierarchy might look like in practice.

If scalable, it would be possible to finish the execution on an edge device to combine the larger model’s output representation with the latest available tokens the data center could not process due to latency. Since sequence trajectories exponentially diverge, these latest tokens available on the mobile/edge device are exponentially more valuable than older tokens, and the tail model can effectively correct the large model’s mispredictions.

But what is the large model actually doing to help the last two layers? Is it really simulating?

Beyond the traditional sense of in-context learning generalizing from clear cut in-context examples (e.g. few-shot learning, induction), could the loss of later tokens also be lower because the model is using space in early tokens’ residual streams to approximately represent features of multiple potential sequence trajectories in superposition^[5] and simulate their abstracted dynamics forward a number of steps for later copying/induction (limited by storage of exponentially diverging futures and network depth)?

From this Google Colab, we can see that HuggingFace’s GPT-2 small contains heads in later layers whose effects on the residual stream increase the logits of the token after the attention query (relative to the rest of the tokens in the sequence)^[6] while exhibiting copying behavior via their OV circuits (positive eigenvalues). Moreover, when we replace the activation inputs to the highest scoring head with the initial token embeddings, the head becomes the lowest scoring head on this criteria. Thus, the head seems to be copying information that was written by a head from a previous layer, as opposed to information that existed in the embeddings from the beginning. And from a few attention pattern visualizations, we can see examples where the model could have stored multiple potential next tokens (or general abstractions) in a token’s residual stream, and then copied them later in the sequence (i.e. an early head could potentially be saying, “let me use some extra space in this token’s residual stream, which won’t interfere with the unembedding, to write down information that I’m predicting can copied/slightly altered by another head to predict a later token”).

If a Transformer can do this, can the same capability be used to compute an activation vector in the data center, based on slightly less context due to network latency, that, when fed to additional layers on the edge device along with the latest context, reduces the amount of computation needed to accurately predict next tokens?

So what can we do about overhang collapses?

Interpretability can help us identify overhangs and predict characteristics of collapses. The worst case is a large overhang collapse that we did not predict and prepare for ahead of time, while the best case is something akin to identifying unstable snow and setting off small explosives to trigger a controlled avalanche where no one gets hurt. In an ideal world, using interpretability techniques to identify capability overhangs and reasoning about their implications is easier than the research/engineering work required to actually collapse the overhang. Then, responsible AI researchers who’ve identified overhangs will hopefully have bought themselves time to build infrastructure that prepares society for when they collapse the overhang (or release information that leads to the collapse of the overhang).

Building infrastructure can shape incentives to direct the application of new capabilities, like placing rocks in a dry riverbed to direct the water flow when a dam breaks. Consider the space of all possible capability applications and a surface defining their cost/value ratios; infrastructure creates grooves that define the path of least resistance along this surface (i.e. which applications can create the most value for the least cost). Collaborating with AI to build this infrastructure will be a highly leveraged use of existing capabilities ahead of collapses, and those who build infrastructure for new capabilities can have significant influence on how those capabilities are applied (e.g. OpenAI’s ChatGPT plugins infrastructure with safety as a core principle).

Well, what kind of infrastructure can steer the applications of increased communication bandwidth?

Edge compute infrastructure could influence the shape of the consumer hardware landscape, and by extension, human-AI interaction. If we assume that it is exponentially harder to simulate one unit of time further into the future, then edge compute is exponentially more valuable than data center compute, where network performance is the exponential factor. A unit increase of network performance exponentially decreases edge compute’s value relative to data center compute value. e.g. let’s assume predicting t + 100ms into the future is twice as hard as predicting t ms into the future: 2**(roundTripLatency / 100ms) = edge compute value / data center compute value. With a 400ms round trip, it’s 16 times harder for the data center to achieve the same accuracy as the edge device; if we can halve to 200ms, it’s only 4 times as hard.

But networks have constraining physical limits and a GPU won’t fit in every edge device, so how can edge hardware adapt to optimize performance?

Consider a WiFi router-like device, but with a GPU, that wirelessly connects to and accelerates edge devices. Assuming that 1) predicting t + 100ms into the future is twice as hard as predicting t ms into the future, 2) the round trip between this device and all of the consumer’s edge devices is 10ms, and 3) the data center round trip is 110ms, then our new device’s computation is only 2x more valuable than the data center computation (2**(110 −10)/100). But, let’s say we want to simulate the future at a finer granularity and it turns out it’s now twice as hard to predict each 10ms further into the future–our new device’s computation is now 1024x more valuable than the data center computation (2**(110-10)/10).

If this is true, then the company that builds this infrastructure will have significant influence on integrations with consumer devices (or build vertically integrated consumer devices), and thus significant influence on human-AI interaction.

Infrastructure can increase end user alignment experimentation velocity (“alignment games”^[7]).

As human-AI communication bandwidth increases, human-AI feedback will growingly resemble human-human interactions, with more in-context interactions, demonstrations, and evaluations (as opposed to ranking model outputs). This high dimensional alignment signal, coming from a small group or even a single human, can more sharply align AIs in specific directions; thus, as the cost per unit of value from human feedback decreases, an increasing share of alignment may be performed by decentralized groups of end users themselves. Infrastructure for both decentralized alignment (e.g. alignment consensus protocol, multi-user interface framework, service APIs) and centralized alignment (e.g. Scale AI) will influence how this increasing communication bandwidth is applied, with centralized alignment serving as a valuable base layer for diverse human feedback.

As a concrete example, consider end users aligning their own social feeds with natural feedback instead of companies updating recommendations in unintuitive ways based on sparse feedback such as view time, likes, comments, etc. Before ChatGPT’s demonstration of RLHF at scale, reinforcement learning was likely difficult to leverage in these recommendation stacks because it relied solely on this sparse, somewhat unconscious feedback. But with pre-trained language models, and future models that saturate the networked memory hierarchy to support larger observation/action spaces, we could see applications that enable groups of end users to naturally and consciously interact with each other and AI to train a reward model that aligns their feed.

Infrastructure can increase the iteration speed and efficiency of human debate. This is possible to a certain extent with existing capabilities e.g. dumping writings (with authors) in a vector DB and instructing an LLM to search over writings/authors and simulate debate. The AI can be aligned as a “critical reviewer/debate simulator” that is aware of all related arguments/evidence and helps humans understand each others’ perspectives. Add this together with an increasingly powerful alignment signal from a single human and you get something interesting: As someone with public writings, podcasts, etc., you can align your AI to interact with humans who want to talk to you and help you find humans to interact with (e.g. your AI interacts with someone who it finds to deeply understand your perspective and recommends that you consider interacting with them; you engage with this person and provide feedback to your AI on why this was a good recommendation).

But why does this matter?

Orthogonal to the problem of individual human intelligence amplification, there is the problem of intelligence discovery–i.e., society needs to actually discover and utilize the increased intelligence. People with large amounts of capital and social influence do not have the cognitive capacity to evaluate everyone’s potential value. Fundamentally, this is a communication problem leading to a sub-optimal utilization of human intelligence. So if we can direct increasing human-AI communication bandwidth toward increasing the efficiency of human communication and intelligence/capital allocation, we can improve humans’ collective ability to solve alignment (and maybe even use the increased collective human intelligence to reduce the number of failures from short-term capitalism directing resources toward dangerous applications).

Stepping back from the specific networked memory hierarchy overhang collapse, how can we think about overhang collapses in general?

As the sizes of AI capability overhangs increase, so does the potential damage from their collapses: a relatively small change to an existing system floods the market with an increasingly large amount of new capability. The amount of resources and coordination required to collapse overhangs (e.g. building a system to saturate the networked memory hierarchy with a slightly modified pre-trained transformer) might not scale proportionally with their size, and AI-amplified individual intelligence could even lead to an inverse scaling. Consequently, the probability mass of who triggers an overhang collapse, eventually resulting in rapidly self-improving AGI, may shift towards smaller groups with fewer resources. If this is true, what else can we do to increase the likelihood of graceful overhang collapses, characterized by slow, less disruptive advancements in AI capabilities?

^
However, the success of human in the loop systems could give humans a false sense of security such that they accelerate the use of AI for dangerous tasks, where the expected value of a negative outcome is higher (i.e. misalignment probability * task danger).
^
To be clear, none of these techniques are guaranteed to align the strongest AIs, but they can help humans align weak AIs to assist with alignment research.
^
I am not claiming that the following evidence proves a Transformer can do this. Rather, I am aiming to build a case of mechanistic plausibility that, when looked at in conjunction with potential effects, motivates further investigation into the potential of Transformers to accelerate AI’s saturation of the memory hierarchy.
^
The same setup, but with the first two layers of the pre-trained GPT-2, achieves a perplexity of 30.84.
^
Superposition is a phenomenon that allows models to represent more features than they have dimensions, scaling with feature sparsity (e.g. feature sparsity is natural for language because “most tokens don’t refer to Martin Luther King or aren’t part of a clause describing music”). This could make it much easier for the model to store/simulate a large number of mutually exclusive possible futures in a smaller space.
^
Slightly different from Anthropic’s copying head evaluator: “Does the head’s direct effect on the residual stream increase the logits of the same token as the one being attended to?”
^
This section is closely related to, and inspired by, Matthew Ball’s essay on cloud gaming and Massive Interactive Live Events (e.g. Twitch Plays Pokemon, Reddit Place).