Research Engineering Intern at the Center for AI Safety. Helping to write the AI Safety Newsletter. Studying CS and Economics at the University of Southern California, and running an AI safety club there. Previously worked at AI Impacts and with Lionel Levine and Collin Burns on calibration for Detecting Latent Knowledge Without Supervision.
aogara
Having a definition of Deceptive Alignment that captured every dangerous behaviour related to deception wouldn’t be very useful. We can use “deception” on its own to refer to this set of cases, and reserve terms like Strategic Deception and Deceptive Alignment for subclasses of deception, ideally subclasses that meaningfully narrow the solution space for detection and prevention.
Fully agreed. Focusing on clean subproblems is important for making progress.
Detecting instances of models that share these properties will likely involve using many of the tools and techniques that would be applied to more canonical forms of deceptive alignment (e.g. evals that attempt to alter/hamstring a model and measure behaviour in a plethora of settings, interpretability).
Though, as you mention, preventing/fixing models that exhibit these properties may involve different solutions, and somewhat crude changes to the training signal may be sufficient for preventing strategic sycophancy (though by doing so you might end up with strategic deception towards some other Misaligned goal).
Yeah I would usually expect strategic deception to be better addressed by changing the reward function, as training is simply the standard way to get models to do anything, and there’s no particular reason why you couldn’t fix strategic deception with additional training. Interpretability techniques and other unproven methods seem more valuable if there are problems that cannot be easily addressed via additional training.
Thanks! First response makes sense, there’s a lot of different ways you could cut it.
On the question of non-strategic, non-intentional deception, I agree that deceptive alignment is much more concerning in the medium term. But suppose that we develop techniques for making models honest. If mechanistic interpretability, unsupervised knowledge detection, or another approach to ELK pans out, we’ll have models which reliably do what they believe is best according to their designer’s goals. What major risks might emerge at that point?
Like an honest AI, humans will often only do what they consciously believe is morally right. Yet the CEOs of tobacco and oil companies believe that their work is morally justified. Soldiers on both sides of a battlefield will believe they’re on the side of justice. Scientists often advance dangerous technologies in the names of truth and progress. Sometimes, these people are cynical, pursuing their self-interest even if they believe it’s immoral. But many believe they are doing the right thing. How do we explain that?
These are not cases of deception, but rather self-deception. These individuals operate in an environment where certain beliefs are advantageous. You will not become the CEO of a tobacco company or a leading military commander if you don’t believe your cause is justified. Even if everyone is perfectly honest about their own beliefs and only pursues what they believe is normatively right, the selection pressure from the environment is so strong that many powerful people will end up with harmful false beliefs.
Even if we build honest AI systems, they could be vulnerable to self-deception encouraged by environmental selection pressure. This is a longer term concern, and the first goal should be to build honest AI systems. But it’s important to keep in mind the problems that would not be solved by honesty alone.
Separately, it seems that deception which is not strategic or intentional but is consistently produced by the training objective is also important. Considering cases like Paul Christiano’s robot hand that learned to deceive human feedback and Ofria’s evolutionary agents that learned to alter their behavior during evaluation, it seems that AI systems can learn to systematically deceive human oversight without being aware of their strategy. In the future, we might see powerful foundation models which are honestly convinced that giving them power in the real world is the best way to achieve their designers’ intentions. This belief might be false but evolutionarily useful, making these models disproportionately likely to gain power. This case would not be called “strategic deception” or “deceptive alignment” if you require intentionality, but it seems very important to prevent.
Overall I think it’s very difficult to come up with clean taxonomies of AI deception. I spent >100 hours thinking and writing about this in advance of Park et al 2023 and my Hoodwinked paper, and ultimately we ended up steering clear of taxonomies because we didn’t have a strong taxonomy that we could defend. Ward et al 2023 formalize a concrete notion of deception, but they also ignore the unintentional deception discussed above. The Stanford Encyclopedia of Philosophy takes 17,000 words to explain that philosophers don’t agree on definitions of lying and deception. Without rigorous formal definitions, I still think it’s important to communicate the broad strokes of these ideas publicly, but I’d lean towards readily admitting the messiness of our various definitions of deception.
How would you characterize strategic sycophancy? Assume that during RLHF a language model is rewarded for mimicking the beliefs of its conversational partner, and therefore the model intelligently learns to predict the conversational partner’s beliefs and mimic them. But upon reflection, the conversational partner and AI developers would prefer that the model report its beliefs honestly.
Under the current taxonomy, this would seemingly be classified as deceptive alignment. The AI’s goals are misaligned with the designer’s intentions, and it uses strategic deception to achieve them. But sycophancy doesn’t include many of the ideas commonly associated with deceptive alignment, such as situational awareness and a difference in behavior between train and test time. Sycophancy can be solved by changing the training signal to not incentivize sycophancy, whereas the hardest to fix forms of deceptive alignment cannot be eliminated by changing the training signal.
It seems like the most concerning forms of deceptive alignment include stipulations about situational awareness and the idea that the behavior cannot necessarily be fixed by changing the training objective.
AISN #22: The Landscape of US AI Legislation - Hearings, Frameworks, Bills, and Laws
MLSN: #10 Adversarial Attacks Against Language and Vision Models, Improving LLM Honesty, and Tracing the Influence of LLM Training Data
AISN #21: Google DeepMind’s GPT-4 Competitor, Military Investments in Autonomous Drones, The UK AI Safety Summit, and Case Studies in AI Policy
I wouldn’t recommend open sourcing any state of the art LLM agents. But if you open source the evaluation, that would provide most of the benefits (letting labs evaluate their models on your benchmark, helping people build new benchmarks, allowing researchers to build safer agents which reject dangerous actions) while avoiding the capabilities externalities of open sourcing a SOTA language agent.
This is cool! I’d be interested to see a fully reproducible version where anybody can download and run the evaluation. Right now, it’s hard to use this methodology to compare the abilities of different language models or scaffolding systems because the results are significantly driven by human prompting skills. To compare the abilities of different models and track progress towards language agents over time, you’d want to hold the evaluation constant and only vary the models and scaffolding structures. This might also be a more accurate analogy to models which are trying to survive and spread without human assistance along the way.
To turn this into a fully reproducible evaluation, you’d need to standardize the initial prompt and allow the agent to interact with the environment without additional prompting or assistance from a human operator. (A human overseer could still veto any dangerous actions.) The agent might get stuck at an intermediate step of the task, so you’d want to decompose each task into many individual steps and check whether the agent can pass each step individually, assuming previous steps were performed correctly. You could create many different scenarios to ensure that idiosyncratic factors aren’t dominating the results—for example, maybe the model is bad at AWS, but would perform much better with Azure or GCP. This benchmark runs this kind of automatic evaluation, but it doesn’t test many interesting tasks related to the survive and spread threat model.
Kudos for doing independent replication work, reproducibility is really important for allowing other people to build on existing work.
Very informative. You ignore inference in your cost breakdown, saying:
Other possible costs, such as providing ChatGPT for free, would have been much smaller.
But Semianalysis says: “More importantly, inference costs far exceed training costs when deploying a model at any reasonable scale. In fact, the costs to inference ChatGPT exceed the training costs on a weekly basis.” Why the discrepancy?
AISN #20: LLM Proliferation, AI Deception, and Continuing Drivers of AI Capabilities
Thanks for the heads up. The web UI has a few other bugs too—you’ll notice that it also doesn’t display your actions after you choose them, and occasionally it even deletes messages that have previously been shown. This was my first React project and I didn’t spend enough time to fix all of the bugs in the web interface. I won’t release any data from the web UI until/unless it’s fixed.
The Python implementation of the CLI and GPT agents is much cleaner, with ~1000 fewer lines of code. If you download and play the game from the command line, you’ll find that it doesn’t have any of these issues. This is where I built the game, tested it, and collected all of the data in the paper.
The option to vote to banish themselves is deliberately left open. You can view the results as a basic competence test of whether the agents understand the point of the game and play well. Generally smaller models vote for themselves more often. Data here.
Hoodwinked: Evaluating Deception Capabilities in Large Language Models
AISN #19: US-China Competition on AI Chips, Measuring Language Agent Developments, Economic Analysis of Language Model Propaganda, and White House AI Cyber Challenge
Relevant debate on how WebGPT could help or hurt safety: https://www.lesswrong.com/posts/3S4nyoNEEuvNsbXt8/common-misconceptions-about-openai?commentId=rAz2vpaTQEBMedkHc
Hey, I’m running a skilling up program for the AI Safety club at my school. Would you be able to share a GitHub repo with any of the lessons from ARENA? No need to prepare something new, just wondering if you have anything easily available. Thanks!
AISN #18: Challenges of Reinforcement Learning from Human Feedback, Microsoft’s Security Breach, and Conceptual Research on AI Safety
That all sounds reasonable. I do think the other case is interesting, when you can verify but not generate the target answer. There are some methods for finding passwords / generating adversarial prompts which induce behaviors that can be verified by a classifier, but they’re weak, and it’d be great to have stronger attacks to prevent sandbagging and improve adversarial training.
If I understand correctly, you’re concerned these methods will work in cases where we can generate high quality answers ourselves, but might fail in settings where we can merely verify good answers.
This is an important gap for many techniques to bridge, and I do think existing methods will perform better in settings where we can generate high quality answers ourselves. But I’d make two points. First, there are many powerful and dangerous capabilities where we know the target output, and this would be a more tractable setting for beginning this line of research. Second, even in cases where we can only verify outputs, the existing literature includes techniques for finding prompts to induce those outputs.
Many of the most concerning capabilities can be evaluated in settings where we do know the target output. If we want to assess biorisks from a model, we can ask multiple choice questions about virology and wet lab security protocols, where we know the correct answer choice in advance. If we want to assess general capabilities, we can ask a model to take the bar exam, medical licensing exam, or the IMO. These evaluations capture much of the concerning behavior that regulators might audit a model for. And because we know the target outputs for each task, gradient-based techniques for prompt optimization could help us find the password (or another prompt) that demonstrates the model’s hidden capabilities.
But if we don’t know the target output and can only verify answers after they’ve been generated, existing techniques can still help find prompts that cause those outputs. Human judgements of output quality can be used to train a model which scores outputs. In an RLHF setup, this would be the reward model providing the loss used to guide gradient updates of the generator model. In an adversarial attack, this same model can be used to figure out which prompts lead to high quality outputs. For example, ARCA is a gradient based method for finding prompts which cause generations that receive high scores from a scoring model. These techniques are not always very strong. If you’re interested in developing more powerful methods for inducing behaviors which cannot be specified but can be verified, check out CAIS’s automated red teaming competition which will use classifiers to evaluate whether the harmful behavior was demonstrated.
For a first paper on this topic, you might fine-tune an LLM to sandbag a standard benchmark unless a password is present, then attempt to find that password. If finding the password is difficult, then the Trojans people have an interesting new problem to work on, and evaluators and red teamers will need to solve that problem before they can reliable audit models using standard benchmarks. OTOH, if finding the password is too easy, then you can create more sophisticated password protected models, or move to an open-ended generation setting.
Overall I think this is an interesting new framing of the Trojan / adversarial attack problem, where the goal is not to destroy model performance but rather to discover hidden capabilities. Many existing techniques would be helpful in solving this problem, but there’s a lot of work to be done adapting the existing literature to this new framing and introducing the problem of detecting complex hidden abilities.
Fixed this! There was a regex on the frontend which incorrectly retrieved all lists of voting options from the prompt, rather than the most recent list of voting options. This led website users to have their votes incorrectly parsed. Fortunately, this was only a frontend problem and did not affect the GPT results.
Here’s the commit. Thanks again for the heads up.