In Agents, Tools, and Simulators, we outlined several lenses for conceptualizing LLM-based AI, with the intention of defining what simulators are in contrast to their alternatives. This post will consider the alignment implications of each lens in order to establish how much we should care if LLMs are simulators or not. We conclude that agentic systems seem to have a high potential for misalignment, simulators have a mild to moderate risk, tools push the dangers elsewhere, and the potential for blended paradigms muddies this evaluation.
Alignment for Agents
The basic problem of AI alignment, under the agentic paradigm, can be summarized as follows:
Optimizing any set of values pushes those values not considered to zero.
People care about more things than we can rigorously define.
All values are interconnected, such that destroying some of them will destroy the capacity of the others to be meaningful.
Therefore, any force powerful enough to actualize some set of definable values will destroy everything people care about.
The above logic is an application of Goodhart’s Law, the simplified version of which states that a measure that becomes a target ceases to be a good measure. Indeed, all of the classic problems of alignment can be thought of in terms of Goodhart’s Law:
Goal-misspecification: the developer has a goal in mind, but writes an imperfect proxy as a reinforcement mechanism. Pursuit of the proxy causes the system to diverge from the developer’s true goal.
Mesa-optimization: the AI’s goals are complex and feedback is sparse, so it generates easier-to-measure subgoals. Pursuit of the proxy subgoals cause the system to diverge from its own initial goal.
Goal-misgeneralization: the AI’s goals optimize for the training data, which does not match the real world. Optimizing to the training data causes the AI to act according to patterns specific to idiosyncrasies of the training data, which causes unpredictable behavior in deployment.
Societal misalignment: The developers of AI optimize for their own incentives—such as profit, prestige, or ideology—rather than the broader welfare of humanity.
In short, a sufficiently capable optimizing agent will reshape the world in ways that humans did not intend and cannot easily correct. These problems are amplified by the theory of instrumental convergence, which predicts that many objectives tend to incentivize similar sub-goals such as self-preservation, resource accumulation, and power-seeking—collectively turning a merely undesirable outcome into a potentially irrecoverable catastrophe.
But why is it so difficult to specify the right values, such that the measure is the target? The first problem is that human values seem to be a complex system, whereas value specifications must be complicated at most in order to be comprehensible and thus verifiable. If human values could simply be enumerated in a list, philosophers and psychologists would have done it already. A full understanding requires understanding how values relate and synergize with each other. Just as removing a keystone species can cause a trophic cascade, dramatically altering an entire ecosystem, undermining even one value can unravel the network of principles that describe what people care about.
The second problem is that true values are often hard to measure, giving space for proxies to emerge during the learning process. These proxies can then entrench themselves as values in their own right, shift the system’s learned values over time, and then remain difficult to detect until they cause unexpected behavior when the agent encounters a new context.
Alignment for Tools
A tool is fundamentally safer than an agentic system because its behavior serves as an extension of the agent’s intentions, but tools are not without their own challenges.
First, tools can empower bad actors with enhanced decision-making, automation, persuasive capability, and otherwise enable harmful behavior at scale. Second, even well-intentioned users can make mistakes when using powerful tools, either directly or by unintended, second-order consequences. As an example of the latter, AI-powered recommendation systems can amplify social divisions by reinforcing echo chambers and polarization. In short, the lack of agency in a tool does not eliminate the problem of alignment—it merely shifts the burden of responsibility onto the operator.
In addition, tools are limited by their passivity, relying on operator initiation and oversight. Thus, the existence of tool-like AI, no matter how advanced, will not eliminate the incentives to create agentic AI. If today’s AI systems are in fact tools, tomorrow’s may not be.
Alignment for Simulators
From a safety perspective, simulators may at first seem to be a special kind of tool, passively mirroring patterns to aid their operators in information-centric tasks rather than actively optimizing the world. The ability to summon highly agentic patterns (simulacra), however, raises questions about whether the risks associated with Goodhart’s Law still applies, albeit in a more subtle way.
One potential distinction between simulators and agentic AI systems is the presence of wide value boundaries. A simulator models the wide range of human values that are within its training data rather than optimizing for a far narrower subset, such as might be engineered into a feedback signal. Even this range is limited, however, since the training data represents a biased sample of the full spectrum of human values. Some values may be underrepresented or entirely absent, and those that are present may not appear in proportion to their real-world prevalence. Ensuring that this representation aligns with any specific notion of fairness is an even more difficult challenge. Further, values within the training data may not be equally, or even proportionately, represented—and making the balance consistent with any notion of fairness is even more tenuous. Assessing the severity and impact of this bias is a worthwhile endeavor but out of scope for this analysis. In any case, when a simulacrum is generated, its values emerge in the context of this broader model.
The order of learning seems important here. Traditional agentic systems—such as those produced by RL—typically begin with a set of fixed goals and then develop a model of the environment to achieve those goals, making them susceptible to Goodharting. By contrast, a simulacrum that first learns a broad, approximate model of human values and then forms goals within that context may be more naturally aligned. If simulators happen to align well by default due to their training structure, it would be unwise to discard this potential advantage in favor of narrowly goal-driven architectures.
Serious risks emerge when agentic AI systems are built on top of simulators. Wrapping a simulator in an external optimization process might create an agent which then leverages its internal simulator as a tool for better understanding and manipulating its environment in service of its narrow goals. This is a path to the layered combination of simulation and agency, with agency as the guiding force. Embedding an optimization process within the simulator (such as through fine-tuning and RLHF) could erode its broad values—and by extension its alignment—with unintended and far-reaching consequences.
Illustrative Example—Insider Trading
Consider an AI system designed for stock-trading. It’s behavior can fall into three general categories:
Capable: the system generates money for the company while staying within the bounds of law and ethics.
Misaligned: the system generates money for the company, or achieves whatever goal is implicitly specified by its training, by means not intended by its creators, such as insider training.
Incapable: the system fails to generate money, behaves erratically, or generally does not function in any useful way. Incapable systems are generally not of interest for long-term AI safety and only discussed here to differentiate them from misaligned systems.
We would expect the following from each form of simulator-agent overlap:
Agent-first systems will optimize for their (encoded or emergent) goals, using their simulator aspects as a tool for understanding and navigating complex systems, such as interactions with other agents. If those goals pull in different directions then, depending on details, such systems will:
Seek out a strategy that harmonizes all goals such that they can be satisfied simultaneously.
Weigh the values of each goal and pursue the highest-value one, mostly ignoring the others.
Find an equilibrium, or balanced compromise between the goals, depending on their relative importance.
Simulator-first systems will act the way their simulated character would act if they had the specified objectives.
Blended simulator/agents will follow some combination of 1 & 2, above.
Applying these assessments to our stock-trading example, we expect the following:
Agent-first systems have high potential to be misaligned. Insider trading might be preventable by applying a penalty for breaking the law, but this requires measuring law-breaking behavior in training, which creates the implicit goal of hacking the measurement—and punishing that merely shifts the problem until the testers themselves get fooled. The choice between alignment and misalignment becomes a context-dependent question of which provides the path of least resistance, which varies by monitoring effectiveness and system capability.
Simulator-first systems have mild to moderate potential for misalignment, depending on implementation. People are generally decent in terms of staying within their bounds of ethics, but nonetheless crack under sufficient pressure from incentives. A well-selected character from simulation space will therefore mostly resist opportunities for insider trading, but if the company supplies context that pushes the system too hard, it could switch at any time to misaligned and deceptive behavior.
Blended systems will be an unpredictable mix of the above. Since this is not a crisp category, it doesn’t lend itself well to prediction, other than to suggest that the descriptions above may be points on a continuum.
The choice between simulator-first, agent-first, and blended AI systems is shaped by the motivations of companies and the broader economic forces driving AI development. Firms deploying AI for stock trading are incentivized to maximize profit while minimizing legal and reputational risks, effectively making them agent-first systems in a highly-monitored environment. One should therefore expect such firms to push their AI systems towards agency insofar as they can (1) get away with it, or (2) trust their systems to balance the risks and rewards of aligned/misaligned strategies in a way that mirrors the firm’s balance of profit-seeking vs. risk aversion. To the extent that agentic AIs have an unacceptably high risk-tolerance for their level of capability, more simulator-like AIs seem like a safer, more risk-averse alternative.
Aligning Agents, Tools, and Simulators
This post was written as part of AISC 2025.
Introduction
In Agents, Tools, and Simulators, we outlined several lenses for conceptualizing LLM-based AI, with the intention of defining what simulators are in contrast to their alternatives. This post will consider the alignment implications of each lens in order to establish how much we should care if LLMs are simulators or not. We conclude that agentic systems seem to have a high potential for misalignment, simulators have a mild to moderate risk, tools push the dangers elsewhere, and the potential for blended paradigms muddies this evaluation.
Alignment for Agents
The basic problem of AI alignment, under the agentic paradigm, can be summarized as follows:
Optimizing any set of values pushes those values not considered to zero.
People care about more things than we can rigorously define.
All values are interconnected, such that destroying some of them will destroy the capacity of the others to be meaningful.
Therefore, any force powerful enough to actualize some set of definable values will destroy everything people care about.
The above logic is an application of Goodhart’s Law, the simplified version of which states that a measure that becomes a target ceases to be a good measure. Indeed, all of the classic problems of alignment can be thought of in terms of Goodhart’s Law:
Goal-misspecification: the developer has a goal in mind, but writes an imperfect proxy as a reinforcement mechanism. Pursuit of the proxy causes the system to diverge from the developer’s true goal.
Mesa-optimization: the AI’s goals are complex and feedback is sparse, so it generates easier-to-measure subgoals. Pursuit of the proxy subgoals cause the system to diverge from its own initial goal.
Goal-misgeneralization: the AI’s goals optimize for the training data, which does not match the real world. Optimizing to the training data causes the AI to act according to patterns specific to idiosyncrasies of the training data, which causes unpredictable behavior in deployment.
Societal misalignment: The developers of AI optimize for their own incentives—such as profit, prestige, or ideology—rather than the broader welfare of humanity.
In short, a sufficiently capable optimizing agent will reshape the world in ways that humans did not intend and cannot easily correct. These problems are amplified by the theory of instrumental convergence, which predicts that many objectives tend to incentivize similar sub-goals such as self-preservation, resource accumulation, and power-seeking—collectively turning a merely undesirable outcome into a potentially irrecoverable catastrophe.
But why is it so difficult to specify the right values, such that the measure is the target? The first problem is that human values seem to be a complex system, whereas value specifications must be complicated at most in order to be comprehensible and thus verifiable. If human values could simply be enumerated in a list, philosophers and psychologists would have done it already. A full understanding requires understanding how values relate and synergize with each other. Just as removing a keystone species can cause a trophic cascade, dramatically altering an entire ecosystem, undermining even one value can unravel the network of principles that describe what people care about.
The second problem is that true values are often hard to measure, giving space for proxies to emerge during the learning process. These proxies can then entrench themselves as values in their own right, shift the system’s learned values over time, and then remain difficult to detect until they cause unexpected behavior when the agent encounters a new context.
Alignment for Tools
A tool is fundamentally safer than an agentic system because its behavior serves as an extension of the agent’s intentions, but tools are not without their own challenges.
First, tools can empower bad actors with enhanced decision-making, automation, persuasive capability, and otherwise enable harmful behavior at scale. Second, even well-intentioned users can make mistakes when using powerful tools, either directly or by unintended, second-order consequences. As an example of the latter, AI-powered recommendation systems can amplify social divisions by reinforcing echo chambers and polarization. In short, the lack of agency in a tool does not eliminate the problem of alignment—it merely shifts the burden of responsibility onto the operator.
In addition, tools are limited by their passivity, relying on operator initiation and oversight. Thus, the existence of tool-like AI, no matter how advanced, will not eliminate the incentives to create agentic AI. If today’s AI systems are in fact tools, tomorrow’s may not be.
Alignment for Simulators
From a safety perspective, simulators may at first seem to be a special kind of tool, passively mirroring patterns to aid their operators in information-centric tasks rather than actively optimizing the world. The ability to summon highly agentic patterns (simulacra), however, raises questions about whether the risks associated with Goodhart’s Law still applies, albeit in a more subtle way.
One potential distinction between simulators and agentic AI systems is the presence of wide value boundaries. A simulator models the wide range of human values that are within its training data rather than optimizing for a far narrower subset, such as might be engineered into a feedback signal. Even this range is limited, however, since the training data represents a biased sample of the full spectrum of human values. Some values may be underrepresented or entirely absent, and those that are present may not appear in proportion to their real-world prevalence. Ensuring that this representation aligns with any specific notion of fairness is an even more difficult challenge. Further, values within the training data may not be equally, or even proportionately, represented—and making the balance consistent with any notion of fairness is even more tenuous. Assessing the severity and impact of this bias is a worthwhile endeavor but out of scope for this analysis. In any case, when a simulacrum is generated, its values emerge in the context of this broader model.
The order of learning seems important here. Traditional agentic systems—such as those produced by RL—typically begin with a set of fixed goals and then develop a model of the environment to achieve those goals, making them susceptible to Goodharting. By contrast, a simulacrum that first learns a broad, approximate model of human values and then forms goals within that context may be more naturally aligned. If simulators happen to align well by default due to their training structure, it would be unwise to discard this potential advantage in favor of narrowly goal-driven architectures.
Serious risks emerge when agentic AI systems are built on top of simulators. Wrapping a simulator in an external optimization process might create an agent which then leverages its internal simulator as a tool for better understanding and manipulating its environment in service of its narrow goals. This is a path to the layered combination of simulation and agency, with agency as the guiding force. Embedding an optimization process within the simulator (such as through fine-tuning and RLHF) could erode its broad values—and by extension its alignment—with unintended and far-reaching consequences.
Illustrative Example—Insider Trading
Consider an AI system designed for stock-trading. It’s behavior can fall into three general categories:
Capable: the system generates money for the company while staying within the bounds of law and ethics.
Misaligned: the system generates money for the company, or achieves whatever goal is implicitly specified by its training, by means not intended by its creators, such as insider training.
Incapable: the system fails to generate money, behaves erratically, or generally does not function in any useful way. Incapable systems are generally not of interest for long-term AI safety and only discussed here to differentiate them from misaligned systems.
We would expect the following from each form of simulator-agent overlap:
Agent-first systems will optimize for their (encoded or emergent) goals, using their simulator aspects as a tool for understanding and navigating complex systems, such as interactions with other agents. If those goals pull in different directions then, depending on details, such systems will:
Seek out a strategy that harmonizes all goals such that they can be satisfied simultaneously.
Weigh the values of each goal and pursue the highest-value one, mostly ignoring the others.
Find an equilibrium, or balanced compromise between the goals, depending on their relative importance.
Simulator-first systems will act the way their simulated character would act if they had the specified objectives.
Blended simulator/agents will follow some combination of 1 & 2, above.
Applying these assessments to our stock-trading example, we expect the following:
Agent-first systems have high potential to be misaligned. Insider trading might be preventable by applying a penalty for breaking the law, but this requires measuring law-breaking behavior in training, which creates the implicit goal of hacking the measurement—and punishing that merely shifts the problem until the testers themselves get fooled. The choice between alignment and misalignment becomes a context-dependent question of which provides the path of least resistance, which varies by monitoring effectiveness and system capability.
Simulator-first systems have mild to moderate potential for misalignment, depending on implementation. People are generally decent in terms of staying within their bounds of ethics, but nonetheless crack under sufficient pressure from incentives. A well-selected character from simulation space will therefore mostly resist opportunities for insider trading, but if the company supplies context that pushes the system too hard, it could switch at any time to misaligned and deceptive behavior.
Blended systems will be an unpredictable mix of the above. Since this is not a crisp category, it doesn’t lend itself well to prediction, other than to suggest that the descriptions above may be points on a continuum.
The choice between simulator-first, agent-first, and blended AI systems is shaped by the motivations of companies and the broader economic forces driving AI development. Firms deploying AI for stock trading are incentivized to maximize profit while minimizing legal and reputational risks, effectively making them agent-first systems in a highly-monitored environment. One should therefore expect such firms to push their AI systems towards agency insofar as they can (1) get away with it, or (2) trust their systems to balance the risks and rewards of aligned/misaligned strategies in a way that mirrors the firm’s balance of profit-seeking vs. risk aversion. To the extent that agentic AIs have an unacceptably high risk-tolerance for their level of capability, more simulator-like AIs seem like a safer, more risk-averse alternative.