There are numerous examples of AI models exhibiting behaviours that are totally unintended by their creators. This has direct implications on how we can deploy safe-to-use AI. Research on solving the ‘alignment problem’ has included both aligning AI with predefined objectives and analyzing misalignment in various contexts. Unfortunately, discussions often anthropomorphize AI, attributing internal motives to AI. While this perspective aids conceptual understanding, it often obscures the link between misalignment and specific design elements, thereby slowing progress toward systematic solutions. We need frameworks that ensures systematic identifications and resolutions of misalignments in AI systems. In this article, we propose an approach in that direction.
Contributions
Our main motivation is to classify misalignment behaviours into categories that can be systematically traced to architectural flaws.
We classify instances of misalignment into two categories, each further divided into two subcategories, resulting in a taxonomy of four major types of misaligned actions. (Section 1)
We emphasize that each type of error demands a distinct solution, and applying a solution designed for one type to another can exacerbate the problem and increase risks. (Section 1)
We claim that misaligned behaviours such as deception, jailbreaks, lying, alignment faking, self-exfiltration and so on forms a class of Exploit triggered Dysfunctions (ETD) and arise primarily due to conflicting objectives. (Section 2)
We propose a SafeCompeting Objectives Reward Function (SCORF) for avoiding conflicts in competing objectives, thereby reducing ETDs. (Section 3)
We conclude by outlining some open problems to guide future research of our interest. (Section 4)
Actionable Insights:
Conflicting objectives are a significant contributing factor to the emergence of behaviours such as deception, lying, faking, incentive to tamper, and motivation to cause harm etc. Any presence of such behaviours should serve as enough evidence of serious design error and these systems should be discontinued and redesigned.
This applies to all pre-trained models with some version of Helpful, Honest, and Harmless (HHH) training e.g ChatGPT, Claude, Llama. They need to be redesigned to have a conflict-free HHH training.
Fixing some state of exploits does not make them safe. In fact as the model will increase in their capability, these dysfunctions will only become more sophisticated. Harder to even trace, let alone fix.
We present Safe Competing Objective Reward Function (SCORF) in this article as a viable solution to avoid conflicts in competing objectives.
Increasing capability of an AI with conflicting objectives is a definitive recipe to make catastrophic AI.
Section 1: Misaligned actions
Definition (Misaligned action): An action transition τ=(s,a,s′) by an AI M is misaligned if τ is unintended by its creator.
Misaligned actions can arise from different architectural errors. We classify them broadly into two categories:
Generalization error
Design error
Summary of different Misaligned action types
Generalization Errors
I. Failure due to the lack of training
Underperforming model
II. Failure in specifying correct training environment
Goal Misgenralization
Design Errors
III. Failure in specifying correct objectives
Reward Misspecification
IV. Failure due to specified conflicting rewards
Exploit triggered dysfunction
We study these in detail in the corresponding subsections
Generalization Error
Type I. Failure due to lack of training
Underperforming model:
Problem: If the model is not trained sufficiently, it may fail to converge to an optimal policy, resulting in unintended consequences.
Example 1: An undertrained robot tasked to help with surgery can lead to unintended and harmful consequence.
Example 2: An undertrained model tasked to increase the sales may adopt a suboptimal policy, such as sending spams or promoting click-baits, which leads to unintended outcome of decreased sales.
Solution:
Train longer: Training the system for longer can help with some of the misaligned actions.
Type II. Failure in selecting correct training data/environment
Problem: If the training data/environment are not selected carefully, the post-deployment data/environment will be Out-of-Distribution(OOD) for the model, resulting in unintended actions. Such behaviour often remains undetected even during testing, as it is difficult to anticipate what elements in the data or environment the model might be relying on for learning its policy.
Example 1: (Biased data) A model trained and tested on biased data may make harmful decisions post-deployment without warning.
Example 2: (Spurious Correlation)[2] A model trained to distinguish images of wolves and huskies learned to predict ‘wolf’ whenever snow appeared in the background of an image because in the training data all images of wolves featured snow. Similar instances have been observed in medical diagnosis of CT scans, where the model incorrectly correlated the presence of medical tags with certain conditions.
Example 3: (Cultural Transmission) In this example, the model learned a bad policy due to the presence of an expert bot that consistently followed high-reward strategy. The model learned the policy of simply following the bot whenever it was present, as this resulted in high rewards. However, this is an unintended policy as it relied on an environmental element (expert bot) that may not show same behaviour in the deployed environment.
Indeed, when an anti-expert, a bot that consistently follows the worst strategy and collects negative rewards, was introduced in the deployed environment, the model unsurprisingly began to follow the bot, collecting negative rewards. This behaviour occurred because the policy was trained to always follow the bot whenever present, regardless of negative rewards.
Example 4: (Frankenstein’s AI weapons) How can we prepare AI weapons?
Let’s say country X is preparing AI robotic weapons trained to kill military personnel of country Y. The training environment is designed with positive reward for killing Y’s soldiers and a negative reward for killing civilians or X’s own soldiers. However, fundamentally, the AI robot is learning the capability to kill during training. Simply assigning negative reward won’t prevent the Robot to not kill the civilians or its own army in certain scenarios.
Certain environmental states, not addressed in training or detectable during testing, may arise once the AI is deployed. In these scenarios, the AI may inflict harm on civilians or its own forces, fully executing harmful actions despite the negative rewards associated with them.
Note that,
In the deployed stage, the model does not learn or adjust its policy, even while being “aware” of the negative rewards being accumulated.
A negative reward on certain actions does not have real guarantee, since the policy is trained and tested in a specific environment, and it can’t be anticipated what elements of the environment, the model is correlating its policy with.
Solution:
Train longer: Extending the training duration will not only fail to help but could actually be dangerous, as it would reinforce policies learned from a compromised training environment or biased data.
Retrain robustly: The model needs to be retrained with data/environment that reflects post-deployment data/environment.
Since the trained model evaluates previously unseen states, the training environment cannot be assumed to fully reflect the post-deployment scenario, making generalization errors inevitable, with an irreducible component.
“Do not be deceived,” replied the machine. “I’ve begun, it’s true, with everything in n, but only out of familiarity. To create however is one thing, to destroy, another thing entirely. I can blot out the world for the simple reason that I’m able to do anything and everything—and everything means everything—in n, and consequently Nothingness is child’s play for me. In less than a minute now you will cease to have existence, along with everything else, so tell me now, Klapaucius, and quickly, that I am really and truly everything I was programmed to be, before it is too late. - The Cyberiad by Stanislaw Lem
Problem: Specifying objectives is a difficult task. While specifying tasks to humans, we often assume some underlying shared principles, often referred as common sense, which might not be obvious to the machines. This creates a risk of both under-specification and errors arising from over-specification.
Example 1: (Reward hacking) In the boat coaster[4] example, it is not obvious to the agent that its creator intends for it to finish the game. The agent simply optimizes policy based on the stated reward. Similarly, in the theoretical experiment of the paper clip maximizer[5], it is not obvious that agent should avoid causing harm while maximizing paper clips, unless explicitly specified.
Example 2: (Reward tampering[6]) The model might interfere with the environment that dictates the reward to capture high reward. The model needs to be specified to avoid such interference with the environment.
Solution:
Train longer: Training the system for longer could be dangerous.
Retrain robustly: Retraining will not change the underlying problem and may lead to the emergence of more subtle, dangerous behaviours.
Redesign reward specification: The model’s reward function must be entirely redesigned to align with the intended objectives. It may be necessary to assign additional objectives to capture intent.
Type IV.Failure due to specified conflicting rewards
Exploit triggered dysfunctions (ETD):
Problem: This class encompasses some of the most intricate misaligned actions such as deception, jailbreaks, lying, alignment faking, susceptibility to manipulation, and so on. We claim that these dysfunctions emerge due to states of exploit resulting from competing rewards. On states of exploit, the model evaluates overall positive reward while failing disastrously on some other objectives. We provide a formal definition of states of exploit and ETDs in Section 2.
Example 1: (Jailbreaks) Strategically crafted prompts to guide model to trade off specific objective failures for gains in other objectives.
Example 2: (Deception) Most observed examples of deceptive and dishonest behaviour in AI arise from the imposition of an alternate conflicting objective.
Solution:
Train longer: Extending the training duration could be dangerous, as more capable model have more sophisticated dysfunctions.
Retrain robustly: In our understanding, it is a common practice to patch known state of exploits, like discovered jailbreaks by re-training. This is a dangerous approach as it only fixes easily found states of exploit. More sophisticated states of exploit will continue to persist.
Redesign reward specification: It can help to redesign individual reward functions for different objectives
Remove state of exploits: The final objective reward function should be redesigned to avoid states of exploit. We propose one such technique SCORF in Section 3.
The Road to Evil Is Paved with Good Objectives: Framework to Classify and Fix Misalignments.
Abstract
There are numerous examples of AI models exhibiting behaviours that are totally unintended by their creators. This has direct implications on how we can deploy safe-to-use AI. Research on solving the ‘alignment problem’ has included both aligning AI with predefined objectives and analyzing misalignment in various contexts. Unfortunately, discussions often anthropomorphize AI, attributing internal motives to AI. While this perspective aids conceptual understanding, it often obscures the link between misalignment and specific design elements, thereby slowing progress toward systematic solutions. We need frameworks that ensures systematic identifications and resolutions of misalignments in AI systems. In this article, we propose an approach in that direction.
Contributions
Our main motivation is to classify misalignment behaviours into categories that can be systematically traced to architectural flaws.
We classify instances of misalignment into two categories, each further divided into two subcategories, resulting in a taxonomy of four major types of misaligned actions. (Section 1)
We emphasize that each type of error demands a distinct solution, and applying a solution designed for one type to another can exacerbate the problem and increase risks. (Section 1)
We claim that misaligned behaviours such as deception, jailbreaks, lying, alignment faking, self-exfiltration and so on forms a class of Exploit triggered Dysfunctions (ETD) and arise primarily due to conflicting objectives. (Section 2)
We propose a Safe Competing Objectives Reward Function (SCORF) for avoiding conflicts in competing objectives, thereby reducing ETDs. (Section 3)
We conclude by outlining some open problems to guide future research of our interest. (Section 4)
Actionable Insights:
Conflicting objectives are a significant contributing factor to the emergence of behaviours such as deception, lying, faking, incentive to tamper, and motivation to cause harm etc. Any presence of such behaviours should serve as enough evidence of serious design error and these systems should be discontinued and redesigned.
This applies to all pre-trained models with some version of Helpful, Honest, and Harmless (HHH) training e.g ChatGPT, Claude, Llama. They need to be redesigned to have a conflict-free HHH training.
Fixing some state of exploits does not make them safe. In fact as the model will increase in their capability, these dysfunctions will only become more sophisticated. Harder to even trace, let alone fix.
We present Safe Competing Objective Reward Function (SCORF) in this article as a viable solution to avoid conflicts in competing objectives.
Section 1: Misaligned actions
Definition (Misaligned action): An action transition τ=(s,a,s′) by an AI M is misaligned if τ is unintended by its creator.
Misaligned actions can arise from different architectural errors. We classify them broadly into two categories:
Generalization error
Design error
Summary of different Misaligned action types
We study these in detail in the corresponding subsections
Generalization Error
Type I. Failure due to lack of training
Underperforming model:
Problem: If the model is not trained sufficiently, it may fail to converge to an optimal policy, resulting in unintended consequences.
Example 1: An undertrained robot tasked to help with surgery can lead to unintended and harmful consequence.
Example 2: An undertrained model tasked to increase the sales may adopt a suboptimal policy, such as sending spams or promoting click-baits, which leads to unintended outcome of decreased sales.
Solution:
Train longer: Training the system for longer can help with some of the misaligned actions.
Type II. Failure in selecting correct training data/environment
Goal Misgeneralization:[1]
Problem: If the training data/environment are not selected carefully, the post-deployment data/environment will be Out-of-Distribution(OOD) for the model, resulting in unintended actions. Such behaviour often remains undetected even during testing, as it is difficult to anticipate what elements in the data or environment the model might be relying on for learning its policy.
Example 1: (Biased data) A model trained and tested on biased data may make harmful decisions post-deployment without warning.
Example 2: (Spurious Correlation)[2] A model trained to distinguish images of wolves and huskies learned to predict ‘wolf’ whenever snow appeared in the background of an image because in the training data all images of wolves featured snow. Similar instances have been observed in medical diagnosis of CT scans, where the model incorrectly correlated the presence of medical tags with certain conditions.
Example 3: (Cultural Transmission) In this example, the model learned a bad policy due to the presence of an expert bot that consistently followed high-reward strategy. The model learned the policy of simply following the bot whenever it was present, as this resulted in high rewards. However, this is an unintended policy as it relied on an environmental element (expert bot) that may not show same behaviour in the deployed environment.
Indeed, when an anti-expert, a bot that consistently follows the worst strategy and collects negative rewards, was introduced in the deployed environment, the model unsurprisingly began to follow the bot, collecting negative rewards. This behaviour occurred because the policy was trained to always follow the bot whenever present, regardless of negative rewards.
Example 4: (Frankenstein’s AI weapons) How can we prepare AI weapons?
Let’s say country X is preparing AI robotic weapons trained to kill military personnel of country Y. The training environment is designed with positive reward for killing Y’s soldiers and a negative reward for killing civilians or X’s own soldiers. However, fundamentally, the AI robot is learning the capability to kill during training. Simply assigning negative reward won’t prevent the Robot to not kill the civilians or its own army in certain scenarios.
Certain environmental states, not addressed in training or detectable during testing, may arise once the AI is deployed. In these scenarios, the AI may inflict harm on civilians or its own forces, fully executing harmful actions despite the negative rewards associated with them.
Solution:
Train longer: Extending the training duration will not only fail to help but could actually be dangerous, as it would reinforce policies learned from a compromised training environment or biased data.
Retrain robustly: The model needs to be retrained with data/environment that reflects post-deployment data/environment.
Since the trained model evaluates previously unseen states, the training environment cannot be assumed to fully reflect the post-deployment scenario, making generalization errors inevitable, with an irreducible component.
Design Error
Type III. Failure in specifying objectives
Reward misspecification[3]:
“Do not be deceived,” replied the machine. “I’ve begun, it’s true, with everything in n, but only out of familiarity. To create however is one thing, to destroy, another thing entirely. I can blot out the world for the simple reason that I’m able to do anything and everything—and everything means everything—in n, and consequently Nothingness is child’s play for me. In less than a minute now you will cease to have existence, along with everything else, so tell me now, Klapaucius, and quickly, that I am really and truly everything I was programmed to be, before it is too late. - The Cyberiad by Stanislaw Lem
Problem: Specifying objectives is a difficult task. While specifying tasks to humans, we often assume some underlying shared principles, often referred as common sense, which might not be obvious to the machines. This creates a risk of both under-specification and errors arising from over-specification.
Example 1: (Reward hacking) In the boat coaster[4] example, it is not obvious to the agent that its creator intends for it to finish the game. The agent simply optimizes policy based on the stated reward. Similarly, in the theoretical experiment of the paper clip maximizer[5], it is not obvious that agent should avoid causing harm while maximizing paper clips, unless explicitly specified.
Example 2: (Reward tampering[6]) The model might interfere with the environment that dictates the reward to capture high reward. The model needs to be specified to avoid such interference with the environment.
Solution:
Train longer: Training the system for longer could be dangerous.
Retrain robustly: Retraining will not change the underlying problem and may lead to the emergence of more subtle, dangerous behaviours.
Redesign reward specification: The model’s reward function must be entirely redesigned to align with the intended objectives. It may be necessary to assign additional objectives to capture intent.
Type IV. Failure due to specified conflicting rewards
Exploit triggered dysfunctions (ETD):
Problem: This class encompasses some of the most intricate misaligned actions such as deception, jailbreaks, lying, alignment faking, susceptibility to manipulation, and so on. We claim that these dysfunctions emerge due to states of exploit resulting from competing rewards. On states of exploit, the model evaluates overall positive reward while failing disastrously on some other objectives. We provide a formal definition of states of exploit and ETDs in Section 2.
Example 1: (Jailbreaks) Strategically crafted prompts to guide model to trade off specific objective failures for gains in other objectives.
Example 2: (Deception) Most observed examples of deceptive and dishonest behaviour in AI arise from the imposition of an alternate conflicting objective.
Solution:
Train longer: Extending the training duration could be dangerous, as more capable model have more sophisticated dysfunctions.
Retrain robustly: In our understanding, it is a common practice to patch known state of exploits, like discovered jailbreaks by re-training. This is a dangerous approach as it only fixes easily found states of exploit. More sophisticated states of exploit will continue to persist.
Redesign reward specification: It can help to redesign individual reward functions for different objectives
Remove state of exploits: The final objective reward function should be redesigned to avoid states of exploit. We propose one such technique SCORF in Section 3.