CS student at the University of Southern California. Previously worked for three years as a data scientist at a fintech startup. Before that, four months on a work trial at AI Impacts. Currently working with Professor Lionel Levine on language model safety research.
aogara
Model-driven feedback could amplify alignment failures
Very nice. Any ideas on how a language model could tell the difference between training and deployment time? Assume it’s a pure LM with no inputs besides the context window. A few possibilities:
Distribution shift in prompts. Perhaps training prompts are systematically different from deployment prompts. The model could predict this difference in advance, or it could simply observe a bunch of OOD inputs once it reaches deployment. For example, the repeated sequences of templated datasets would be more likely in training than deployment.
Observing a string that could not be generated before a certain date. For example, the front page of the New York Times on a certain date, if and only if the model can verify its accuracy via the internet. I’m not sure if there are any examples that can be verified within the context window alone.
The model could exploit implementation-level details about train vs. test time.
Training might use FP64 or FP32, while deployment might use FP16 or FP8. The model could condition its output on the smallest value representable in its activations.
You mention testing if “purposeful disruptions to the world downstream of high belief in a simulation lead to changes matching what they’d expect if SGD were in control.” Could you say more about this? Do you mean that the model would e.g. output a result that would induce a particular gradient update, and then check whether that gradient update happened? How would the model remember that it’s running this kind of experiment—wouldn’t that require passing information between episodes? Maybe it could be contained within the weights, but that seems tricky.
I’m thinking through a project that would deliberately fine-tune a language model to pursue long-term goals across episodes. This would be a useful red-team for scalable oversight strategies such as model-assisted critiques and Constitutional AI. If you can fine-tune a model such that it deliberately subverts these oversight techniques and preserves its long-term goal, you show that the current techniques are vulnerable to deceptive alignment, and provide an impetus to build techniques with more robust checks against deception.
I don’t think language models will take actions to make future tokens easier to predict
For an analogy, look at recommender systems. Their objective is myopic in the same way as language models: predict which recommendation which most likely result in a click. Yet they have power seeking strategies available, such as shifting the preferences of a user to make their behavior easier to predict. These incentives are well documented and simulations confirm the predictions here and here. The real world evidence is scant—a study of YouTube’s supposed radicalization spiral came up negative, though the authors didn’t log in to YouTube which could lead to less personalization of recommendations.
The jury is out on whether current recommender systems execute power-seeking strategies to improve their supposedly myopic objective. But the incentive and means are clearly present, and to me it seems only a matter of time before we observe this behavior in the wild. Similarly, while I don’t think current language models are creative or capable enough to execute a power seeking strategy, it seems like power seeking by a superintelligent language model would be rewarded with lower loss. If a language model could use its outputs to persuade humans to train it with more compute on more data thereby reducing its loss, there seems to be every incentive for the model to seek power in this way.
Do you think that the next-token prediction objective would lead to instrumentally convergent goals and/or power-seeking behavior? I could imagine an argument that a model would want to seize all available computation in order to better predict the next token. Perhaps the model’s reward schedule would never lead it to think about such paths to loss reduction, but if the model is creative enough to consider such a plan and powerful enough to execute it, it seems that many power-seeking plans would help achieve its goal. This is significantly different from the view advanced by OpenAI, that language models are tools which avoid some central dangers of RL agents, and the general distinction drawn between tool AI and agentic AI.
“Early stopping on a separate stopping criterion which we don’t run gradients through, is not at all similar to reinforcement learning on a joint cost function additively incorporating the stopping criterion with the nominal reward.”
Sounds like this could be an interesting empirical experiment. Similar to Scaling Laws for Reward Model Overoptimization, you could start with a gold reward model that represents human preference. Then you could try to figure out the best way to train an agent to maximize gold reward using only a limited number of sampled data points from that gold distribution. For example, you could train two reward models using different samples from the same distribution and use one for training, the other for early stopping. (This is essentially the train / val split used in typical ML settings with data constraints.) You could measure the best ways to maximize gold reward on a limited data budget. Alternatively your early stopping RM could be trained on a gold RM containing distribution shift, data augmentations, adversarial perturbations, or a challenge set of particularly challenging cases.
Would you be interested to see an experiment like this? Ideally it could make progress towards an empirical study of how reward hacking happens and how to prevent it. Do you see design flaws that would prevent us from learning much? What changes would you make to the setup?
Really cool analysis. I’d be curious to see the implications on a BioAnchors timelines model if you straightforwardly incorporate this compute forecast.
For those interested in empirical work on RLHF and building the safety community, Meta AI’s BlenderBot project has produced several good papers on the topic. A few that I liked:
My favorite one is about filtering out trolls from crowdsourced human feedback data. They begin with the observation that when you crowdsource human feedback, you’re going to get some bad feedback. They identify various kinds of archetypal trolls: the “Safe Troll” who marks every response as safe, the “Unsafe Troll” who marks them all unsafe, the “Gaslight Troll” whose only feedback is marking unsafe generations as safe, and the “Lazy Troll” whose feedback is simply incorrect a random percentage of the time. Then they implement several methods for filtering out incorrect feedback examples, and find that the best approach (for their particular hypothesized problem) is to score each user’s trustworthiness based on agreement with other users’ feedback scores and filter feedback from untrustworthy users. This seems like an important problem for ChatGPT or any system accepting public feedback, and I’m very happy to see a benchmark identifying the problem and several proposed methods making progress.
They have a more general paper about online learning from human feedback. They recommend a modular feedback form similar to the one later used by ChatGPT, and they observe that feedback on smaller models can improve the performance of larger models.
DIRECTOR is a method for language model generation aided by a reward model classifier. The method performs better than ranking and filtering beam searches as proposed in FUDGE, but unfortunately they don’t compare to full fine-tuning on a learned preference model as used by OpenAI and Anthropic.
They also have this paper and this paper on integrating retrieved knowledge into language model responses, which is relevant for building Truthful AI but advances capabilities more than I think safety researchers should be comfortable with.
Meta and Yann LeCun in particular haven’t been very receptive to arguments about AI risk. But the people working on BlenderBot have a natural overlap with OpenAI, Anthropic, and others working on alignment who’d like to collect and use human feedback to align language models. Meta’s language models aren’t nearly as capable as OpenAI’s and have attracted correspondingly less buzz, but they did release BlenderBot 3 with the intention of gathering lots of human feedback data. They could plausibly be a valuable ally and source of research on how to how to robustly align language models using human preference data.
Thanks for sharing, I agree with most of these arguments.
Another possible shortcoming of RLHF is that it teaches human preferences within the narrow context of a specific task, rather than teaching a more principled and generalizable understanding of human values. For example, ChatGPT learns from human feedback to not give racist responses in conversation. But does the model develop a generalizable understanding of what racism is, why it’s wrong, and how to avoid moral harm from racism in a variety of contexts? It seems like the answer is no, given the many ways to trick ChatGPT into providing dangerously racist outputs. Redwood’s work on high reliability filtering came to a similar conclusion: current RLHF techniques don’t robustly generalize to cover similar examples of corrected failures.
Rather than learning by trial and error, it’s possible that learning human values in a systematic, principled way could generalize further than RLHF. For example, the ETHICS benchmark measures whether language models understand the implications of various moral theories. Such an understanding could be used to filter the outputs of another language model or AI agent, or as a reward model to train other models. Similarly, law is a codified set of human behavioral norms with established procedures for oversight and dispute resolution. There has been discussion of how AI could learn to follow human laws, though law has significant flaws as a codification of human values. I’d be interested to see more work evaluating and improving AI understanding of principled systems of human value, and using that understanding to better align AI behavior.
(h/t Michael Chen and Dan Hendrycks for making this argument before I understood it.)
“China is working on a more than 1 trillion yuan ($143 billion) support package for its semiconductor industry.”
“The majority of the financial assistance would be used to subsidise the purchases of domestic semiconductor equipment by Chinese firms, mainly semiconductor fabrication plants, or fabs, they said.”
“Such companies would be entitled to a 20% subsidy on the cost of purchases, the three sources said.”
My impression is this is too little, too late. Does it change any of your forecasts or analysis?
I think it’s worth forecasting AI risk timelines instead of GDP timelines, because the former is what we really care about while the latter raises a bunch of economics concerns that don’t necessarily change the odds of x-risk. Daniel Kokotajlo made this point well a few years ago.
On a separate note, you might be interested in Erik Byrnjolfsson’s work on the economic impact of AI and other technologies. For example this paper argues that general purpose technologies have an implementation lag, where many people can see the transformative potential of the technology decades before the economic impact is realized. This would explain the Solow Paradox, named after economist Robert Solow’s 1987 remark that “you can see the computer age everywhere but in the productivity statistics.” Solow was right that the long-heralded technology had not had significant economic impact at that point in time, but the coming decade would change that narrative with >4% real GDP growth in the United States driven primarily by IT. I’ve been taking notes on these and other economics papers relevant to AI timelines forecasting, send me your email if you’d like to check it out.
Overall I was similarly bearish on short timelines, and have updated this year towards a much higher probability on 5-15 year timelines, while maintaining a long tail especially on the metric of GDP growth.
Probably using the same interface as WebGPT
This is fantastic, thank you for sharing. I helped start USC AI Safety this semester and we’re facing a lot of the same challenges. Some questions for you—feel free to answer some but not all of them:
What does your Research Fellows program look like?
In particular: How many different research projects do you have running at once? How many group members are involved in each project? Have you published any results yet?
Also, in terms of hours spent or counterfactual likelihood of producing a useful result, how much of the research contributions come from students without significant prior research experience vs. people who’ve already published papers or otherwise have significant research experience?
The motivation for this question is that we’d like to start our own research track, but we don’t have anyone in our group with the research experience of your PhD students or PhD graduates. One option would be to have students lead research projects, hopefully with advising from senior researchers that can contribute ~1 hour / week or less. But if that doesn’t seem likely to produce useful outputs or learning experiences, we could also just focus on skilling up and getting people jobs with experienced researchers at other institutions. Which sounds more valuable to you?
What about the general member reading group?
Is there a curriculum you follow, or do you pick readings week-by-week based on discussion?
It seems like there are a lot of potential activities for advanced members: reading groups, the Research Fellows program, facilitating intro groups, weekly social events, and participating in any opportunities outside of HAIST. Do you see a tradeoff where dedicated members are forced to choose which activities to focus on? Or is it more of a flywheel effect, where more engagement begets more dedication? For the typical person who finished your AGISF intro group and has good technical skills, which activities would you most want them to focus on? (My guess would be research > outreach and facilitation > participant in reading groups > social events.)
Broadly I agree with your focus on the most skilled and engaged members, and I’d worry that the ease of scaling up intro discussions could distract us from prioritizing research and skill-building for those members. How do you plan to deeply engage your advanced members going forward?
Do you have any thoughts on the tradeoff between using AGISF vs. the ML Safety Scholars curriculum for your introductory reading group?
MLSS requires ML skills as a prerequisite, which is both a barrier to entry and a benefit. Instead of conceptual discussions of AGI and x-risk, it focuses on coding projects and published ML papers on topics like robustness and anomaly detection.
This semester we used a combination of both, and my impression is that the MLSS selections were better received, particularly the coding assignments. (We’ll have survey results on this soon.) This squares with your takeaway that students care about “the technically interesting parts of alignment (rather than its altruistic importance)”.
MLSS might also be better from a research-centered approach if research opportunities in the EA ecosystem are limited but students can do safety-relevant work with mainstream ML researchers.
On the other hand, AGISF seems better at making the case that AGI poses an x-risk this century. A good chunk of our members still are not convinced of that argument, so I’m planning to update the curriculum at least slightly towards more conceptual discussion of AGI and x-risks.
How valuable do you think your Governance track is relative to your technical tracks?
Personally I think governance is interesting and important, and I wouldn’t want the entire field of AI safety to be focused on technical topics. But thinking about our group, all of our members are more technically skilled than they are in philosophy, politics, or economics. Do you think it’s worth putting in the effort to recruit non-technical members and running a Governance track next semester, or would that effort better be spent focusing on technical members?
Appreciate you sharing all these detailed takeaways, it’s really helpful for planning our group’s activities. Good luck with next semester!
How much do you expect Meta to make progress on cutting edge systems towards AGI vs. focusing on product-improving models like recommendation systems that don’t necessarily advance the danger of agentic, generally intelligent AI?
My impression earlier this year was that several important people had left FAIR, and then FAIR and all other AI research groups were subsumed into product teams. See https://ai.facebook.com/blog/building-with-ai-across-all-of-meta/. I thought this would mean deprioritizing fundamental research breakthroughs and focusing instead on less cutting edge improvements to their advertising or recommendation or content moderation systems.
But Meta AI has made plenty of important research contributions since then: Diplomacy, their video generator, open sourcing OPT and their scientific knowledge bot. Their rate of research progress doesn’t seem to be slowing, and might even be increasing. How do you expect Meta to prioritize fundamental research vs. product going forwards?
Interesting paper on the topic, you might’ve seen it already. They show that as you optimize a particular proxy of a true underlying reward model, there is a U-shaped loss curve: high at first, then low, then overfitting to high again.
This isn’t caused by mesa-optimization, which so far has not been observed in naturally optimized neural networks. It’s more closely related to robustness and generalization under varying amounts of optimization pressure.
But if we grant the mesa-optimizer concern, it seems reasonable that more optimization will result in more coherent inner misaligned goals and more deceptive behavior to hide them. Unfortunately I don’t think slack is really a solution because optimizing an objective really hard is what makes your AI capable of doing interesting things. Handicapping your systems isn’t a solution to AI safety, the alignment tax on it is too high.
The key argument is that StableDiffusion is more accessible, meaning more people can create deepfakes with fewer images of their subject and no specialized skills. From above (links removed):
“The unique danger posed by today’s text-to-image models stems from how they can make harmful, non-consensual content production much easier than before, particularly via inpainting and outpainting, which allows a user to interactively build realistic synthetic images from natural ones, dreambooth, or other easily used tools, which allow for fine-tuning on as few as 3-5 examples of a particular subject (e.g. a specific person). More of which are rapidly becoming available following the open-sourcing of Stable Diffusion. It is clear that today’s text-to-image models have uniquely distinct capabilities from methods like Photoshop, RNNs trained on specific individuals or, “nudifying” apps. These previous methods all require a large amount of subject-specific data, human time, and/or human skill. And no, you don’t need to know how to code to interactively use Stable Diffusion, uncensored and unfiltered, including in/outpainting and dreambooth.”
If what you’re questioning is the basic ability of StableDiffusion to generate deepfakes, here [1] is an NSFW link to www.mrdeepfakes.com who says, “having played with this program a lot in the last 48 hours, and personally SEEN what it can do with NSFW, I guarantee you it can 100% assist not only in celeb fakes, but in completely custom porn that never existed or will ever exist.” He then provides links to NSFW images generated by StableDiffusion, including deepfakes of celebrities. This is apparently facilitated by the LAION-5B dataset which Stability AI admits has about 3% unsafe images and which he claims has “TONS of captioned porn images in it”.
[1] Warning, NSFW: https://mrdeepfakes.com/forums/threads/guide-using-stable-diffusion-to-generate-custom-nsfw-images.10289/
I was wrong that nobody in China takes alignment seriously! Concordia Consulting led by Brian Tse and Tianxia seem to be leading the charge. See this post and specifically this comment. To the degree that poor US-China relations slow the spread of alignment work in China, current US policy seems harmful.
Very interesting argument. I’d be interested in using a spreadsheet where I can fill out my own probabilities for each claim, as it’s not immediately clear to me how you’ve combined them here.
On one hand, I agree that intent alignment is insufficient for preventing x-risk from AI. There are too many other ways for AI to go wrong: coordination failures, surveillance, weaponization, epistemic decay, or a simple failure to understand human values despite the ability to faithfully pursue specified goals. I’m glad there are people like you working on which values to embed in AI systems and ways to strengthen a society full of powerful AI.
On the other hand, I think this post misses the reason for popular focus on intent alignment. Some people think that, for a sufficiently powerful AI trained in the current paradigm, there is no goal that it could faithfully pursue without collapsing into power seeking, reward hacking, and other instrumental goals leading to x-risk. Ajeya Cotra’s framing of this argument is most persuasive to me. Or Eliezer Yudkowsky’s “strawberry alignment problem”, which (I think) he believes is currently impossible and captures the most challenging part of alignment:
“How would you get an AI system to do some very modest concrete action requiring extremely high levels of intelligence, such as building two strawberries that are completely identical at the cellular level, without causing anything weird or disruptive to happen?”
Personally I think there’s plenty of x-risk from intent aligned systems and people should think about what we do once we have intent alignment. Eliezer seems to think this is more distraction from the real problem than it’s worth, but surveys suggest that many people in AI safety orgs think x-risk is disjunctive across many scenarios. Which is all to say, aligning AI with societal values is important, but I wouldn’t dismiss intent alignment either.
I agree with your point but you might find this funny: https://twitter.com/waxpancake/status/1583925788952657920
This sounds right to me, and is a large part of why I’d be skeptical of explosive growth. But I don’t think it’s an argument against the most extreme stories of AI risk. AI could become increasingly intelligent inside a research lab, yet we could underestimate its ability because regulation stifles its impact, until it eventually escapes the lab and pursues misaligned goals with reckless abandon.