Instant classic. Putting it in our university group syllabus next to What Failure Looks Like. Sadly could get lost in the recent LW tidal wave, someone should promote it to the Alignment Forum.
I’d love to see the most important types of work for each failure mode. Here’s my very quick version, any disagreements or additions are welcome:
Predictive model misuse—People use AI to do terrible things.
Adversarial Robustness: Train ChatGPT to withstand jailbreaking. When people try to trick it into doing something bad using a creative prompt, it shouldn’t work.
Anomaly Detection: A related but different approach. Instead of training the base model to give good responses to adversarial prompts, you can train a separate classifier to detect adversarial prompts, and refuse to answer them / give a stock response from a separate model.
Cybersecurity at AI companies: If someone steals the weights, they can misuse the model!
Regulation: Make it illegal to release models that help people do bad things.
Strict Liability: If someone does a bad thing with OpenAI’s model, make OpenAI legally responsible.
Communications: The open source community is often gleeful about projects like ChaosGPT. Is it possible to shift public opinion? Be careful, scolding can embolden them.
Predictive models playing dangerous characters
Haven’t thought about this enough. Maybe start with the new wave of prompting techniques for GPT agents, is there a way to make those agents less likely to go off the rails? Simulators and Waluigi Effect might be useful frames to think through.
Scalable oversight failure without deceptive alignment
Scalable Oversight, of course! But I’m concerned about capabilities externalities here. Some of the most successful applications of IDA and AI driven feedback only make models more capable of pursuing goals, without telling them any more about the content of our goals. This research should seek to scale feedback about human values, not about generic capabilities.
Machine Ethics / understanding human values: The ETHICS benchmark evaluates whether models have the microfoundations of our moral decision making. Ideally, this will generalize better than heuristics that are specific to a particular situation, such as RLHF feedback on helpful chatbots. I’ve heard the objection that “AI will know what you want, it just won’t care”, but that doesn’t apply to this failure mode—we’re specifically concerned about poorly specifying our goals to the AI.
Regulation to ensure slow, responsible deployment. This scenario is much more dangerous if AI is autonomously making military, political, financial, and scientific decisions. More human oversight at critical junctures and a slower transformation into a strange, unknowable world means more opportunities to correct AI and remove it from high stakes deployment scenarios. The EU AI Act identifies eight high risk areas requiring further scrutiny, including biometrics, law enforcement, and management of critical infrastructure. What else should be on this list?
What else? Does ELK count? If descendants of Collin Burns could tell us a model’s beliefs, could we train on not for what would earn the highest reward, but what a human would really want? Or is ELK only for spot checking, and training on its answers would negate its effectiveness?
Deceptive Alignment
Interpretability, ELK, and Trojans seem tractable and useful by my own inside view.
Some things I’m less familiar with that might be relevant: John Wentworth-style ontology identification, agent foundations, and shard theory. Correct me if I’m wrong here!
Recursive Self-Improvement --> hard takeoff
Slowing down. The FLI letter, ARC Evals, Yo Shavit’s compute governance paper, Ethan Cabellero and others predicting emergent capabilities. Maybe we can stop at the edge of the cliff and wait there while we figure something out. I’m sure some would argue this is impossible.
P(Doom) for each scenario would also be useful, as well as further scenarios not discussed here.
I’d love to see the most important types of work for each failure mode. Here’s my very quick version, any disagreements or additions are welcome:
Appreciate you doing a quick version. I’m excited for more attempts at this and would like to write something similar myself, though I might structure it the other way round if I do a high effort version (take an agenda, work out how/if it maps onto the different parts of this). Will try to do a low-effort set of quick responses to yours soon.
P(Doom) for each scenario would also be useful.
Also in the (very long) pipeline, and a key motivation! Not just for each scenario in isolation, but also for various conditionals like: - P(scenario B leads to doom | scenario A turns out not to be an issue by default) - P(scenario B leads to doom | scenario A turns out to be an issue that we then fully solve) - P(meaningful AI-powered alignment progress is possible before doom | scenario C is solved)
Instant classic. Putting it in our university group syllabus next to What Failure Looks Like. Sadly could get lost in the recent LW tidal wave, someone should promote it to the Alignment Forum.
I’d love to see the most important types of work for each failure mode. Here’s my very quick version, any disagreements or additions are welcome:
Predictive model misuse—People use AI to do terrible things.
Adversarial Robustness: Train ChatGPT to withstand jailbreaking. When people try to trick it into doing something bad using a creative prompt, it shouldn’t work.
Anomaly Detection: A related but different approach. Instead of training the base model to give good responses to adversarial prompts, you can train a separate classifier to detect adversarial prompts, and refuse to answer them / give a stock response from a separate model.
Cybersecurity at AI companies: If someone steals the weights, they can misuse the model!
Regulation: Make it illegal to release models that help people do bad things.
Strict Liability: If someone does a bad thing with OpenAI’s model, make OpenAI legally responsible.
Communications: The open source community is often gleeful about projects like ChaosGPT. Is it possible to shift public opinion? Be careful, scolding can embolden them.
Predictive models playing dangerous characters
Haven’t thought about this enough. Maybe start with the new wave of prompting techniques for GPT agents, is there a way to make those agents less likely to go off the rails? Simulators and Waluigi Effect might be useful frames to think through.
Scalable oversight failure without deceptive alignment
Scalable Oversight, of course! But I’m concerned about capabilities externalities here. Some of the most successful applications of IDA and AI driven feedback only make models more capable of pursuing goals, without telling them any more about the content of our goals. This research should seek to scale feedback about human values, not about generic capabilities.
Machine Ethics / understanding human values: The ETHICS benchmark evaluates whether models have the microfoundations of our moral decision making. Ideally, this will generalize better than heuristics that are specific to a particular situation, such as RLHF feedback on helpful chatbots. I’ve heard the objection that “AI will know what you want, it just won’t care”, but that doesn’t apply to this failure mode—we’re specifically concerned about poorly specifying our goals to the AI.
Regulation to ensure slow, responsible deployment. This scenario is much more dangerous if AI is autonomously making military, political, financial, and scientific decisions. More human oversight at critical junctures and a slower transformation into a strange, unknowable world means more opportunities to correct AI and remove it from high stakes deployment scenarios. The EU AI Act identifies eight high risk areas requiring further scrutiny, including biometrics, law enforcement, and management of critical infrastructure. What else should be on this list?
What else? Does ELK count? If descendants of Collin Burns could tell us a model’s beliefs, could we train on not for what would earn the highest reward, but what a human would really want? Or is ELK only for spot checking, and training on its answers would negate its effectiveness?
Deceptive Alignment
Interpretability, ELK, and Trojans seem tractable and useful by my own inside view.
Some things I’m less familiar with that might be relevant: John Wentworth-style ontology identification, agent foundations, and shard theory. Correct me if I’m wrong here!
Recursive Self-Improvement --> hard takeoff
Slowing down. The FLI letter, ARC Evals, Yo Shavit’s compute governance paper, Ethan Cabellero and others predicting emergent capabilities. Maybe we can stop at the edge of the cliff and wait there while we figure something out. I’m sure some would argue this is impossible.
P(Doom) for each scenario would also be useful, as well as further scenarios not discussed here.
Thanks, both for the thoughts and encouragement!
Appreciate you doing a quick version. I’m excited for more attempts at this and would like to write something similar myself, though I might structure it the other way round if I do a high effort version (take an agenda, work out how/if it maps onto the different parts of this). Will try to do a low-effort set of quick responses to yours soon.
Also in the (very long) pipeline, and a key motivation! Not just for each scenario in isolation, but also for various conditionals like:
- P(scenario B leads to doom | scenario A turns out not to be an issue by default)
- P(scenario B leads to doom | scenario A turns out to be an issue that we then fully solve)
- P(meaningful AI-powered alignment progress is possible before doom | scenario C is solved)
etc.
Great response! I would imitative generalization to the “Scalable oversight failure without deceptive alignment” section