I would also assume that methods developed in challenges like the Trojan Detection Challenge or Universal Backdoor Detection would be good candidates to try out. Not saying that these will always work, but I think for the specific type of backdoors implemented in the sleeper agent paper, they might work.
Jérémy Scheurer
I do think linear probes are useful, and if you can correctly classify the target with a linear probe it makes it more likely that the model is potentially “representing something interesting” internally (e.g. the solution to the knapsack problem). But its not guaranteed, the model could just be calculating something else which correlates with the solution to the knapsack problem.
I really recommend checking out the deepmind paper I referenced. Fabien Roger also explains some shortcoming with CCS here. The takeaway is just, be careful when interpreting linear probes. They are useful to some extent, but prone to overinterpretation.
Seems like an experiment worth doing. Some thoughts:
I understand that setting up a classification problem in step 1 is important (so you can train the linear probe). I’d try to come up with a classification problem that a base model might initially refuse (or we’d hope it would refuse). Then the training to say “sorry i can’t help with that” makes more intuitive sense. I get that mechanistically it’s the same thing. But you want to come as close to the real deal as you can, and its unclear why a model would say “I’m sorry I can’t help” to the knapsack problem.
If the linear probe in step 4 can still classify accurately, it implies that there are some activations “which at least correlate with thinking about how to answer the question”, but it does not imply that the model is literally thinking about it. I think it would still be good enough as a first approximation, but I just want to caution generally that linear probes do not show the existence of a specific “thought” (e.g. see this recent paper). Also if the probe can’t classify correctly its not proof that the model does not “think about it”. You’re probably aware of all this, just thought I’d mention it.
This paper might also be relevant for your experiment.
I want to log in a prediction, let me know if you ever run this.
My guess would be that this experiment will just work, i.e., the linear probe will still get fairly high accuracy even after step 3. I think its still worth checking (so i still think its probably worth doing), but overall I’d say its not super surprising if this would happen (see e.g. this paper for where my intuition comes from)
Yeah great question! I’m planning to hash out this concept in a future post (hopefully soon). But here are my unfinished thoughts I had recently on this.
I think using different methods to elicit “bad behavior” i.e. to red team language models have different pros and cons as you suggested (see for instance this paper by ethan perez: https://arxiv.org/abs/2202.03286). If we assume that we have a way of measuring bad behavior (i.e. a reward model or classifier that tells you when your model is outputting toxic things, being deceptive, sycophantic etc., which is very reasonable) then we can basically just empirically compare a bunch of methods and how efficient they are at eliciting bad behavior, i.e. how much compute (FLOPs) they require to get a target LM to output something “bad”. The useful thing about compute is that it “easily” allows us to compare different methods, e.g. prompting, RL or activation steering. Say for instance you run your prompt optimization algorithm (e.g. persona modulation or any other method for finding good red teaming prompts) it might be hard to compare this to say how many gradient steps you took when red teaming with RL. But the way to compare those methods could be via the amount of compute they required to make the target model output bad stuff.
Obviously, you can never be sure that the method you used is actually the best and most compute efficient, i.e. there might always be an undiscovered Red teaming method which makes your target model output “bad stuff”. But at least for all known red teaming methods, we can compare their compute efficiency in eliciting bad outputs. Then we can pick the most efficient one and make claims such as, the new target model X is robust to Y FLOPs of Red teaming with method Z (which is the best method we currently have). Obviously, this would not guarantee us anything. But I think in the messy world we live in it would be a good way of quantifying how robust a model is to outputting bad things. It would also allow us to compare various models and make quantitative statements about which model is more robust to outputting bad things.
I’ll have to think about this more and will write up my thoughts soon. But yes, if we assume that this is a great way of quantifying how “HHH” your model is, or how unjailbreakable, then it makes sense to compare Red teaming methods on how compute efficient they are.
Note there is a second axis which I have not higlighted yet, which is diversity of “bad outputs” produced by the target model. This is also measured in Ethan’s paper referenced above. For instance they find that prompting yields bad output less frequently, but when it does the outputs are more diverse (compared to RL). While we do care mostly about, how much compute did it take to make the model output something bad, it is also relevant whether this optimized method now allows you to get diverse outputs or not (arguably one might care more or less about this depending on what statement one would like to make). I’m still thinking about how diversity fits in this picture.
Thanks a lot for this helpful comment! You are absolutely right; the citations refer to goal misgeneralization which is a problem of inner alignment, whereas goal misspecificatin is related to outer alignment. I have updated the post to reflect this.
Seems to me like this is easily resolved so long as you don’t screw up your book keeping. In your example, the hypothesis implicitly only makes a claim about the information going out of the bubble. So long as you always write down which nodes or layers of the network your hypothesis makes what claims about, I think this should be fine?
Yes totally agree. Here we are not claiming that this is a failure mode of CaSc, and it can “easily” be resolved by making your hypothesis more specific. We are merely pointing out that “In theory, this is a trivial point, but we found that in practice, it is easy to miss this distinction when there is an “obvious” algorithm to implement a given function.”
I don’t know that much about CaSc, but why are you comparing the ablated graphs to the originals via their separate loss on the data in the first place? Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.You are right that this is a failure mode that is mostly due to reducing the behavior down into a single aggregate quantity like the average loss recovered. It can be remedied when looking at the loss on individual samples and not averaging the metric across the whole dataset. In the footnote, we point out that researchers at Redwood Research have actually also started looking at the per-sample loss instead of the aggregate loss.
CaSc was, however, introduced by looking at the average scrubbed loss (even though they say that this metric is not ideal). Also, in practice, when one iterates on generating hypotheses and testing them with CaSc, it’s more convenient to look at aggregate metrics. We thus think it is useful to have concrete examples that show how this can lead to problems.
Your suggestion of using seems a useful improvement compared to most metrics. It’s, however, still possible that cancellation could occur. Cancellation is mostly due to aggregating over a metric (e.g., the mean) and less due to the specific metric used (although I could imagine that some metrics like could allow for less ambiguity).
Thanks for your comments Akash. I think I have two main points I want to address.
I agree that it’s very good that the field of AI Alignment is very competitive! I did not want to imply that this is a bad thing. I was mainly trying to point out that from my point of view, it seems like overall there are more qualified and experienced people than there are jobs at large organizations. And in order to fill that gap we would need more senior researchers, who then can follow their research agendas and hire people (and fund orgs), which is however hard to achieve. One disclaimer I want to note is that I do not work at a large org, and I do not precisely know what kinds of hiring criteria they have, i.e. it is possible that in their view we still lack talented enough people. However, from the outside, it definitely does look like there are many experienced researchers.
It is possible that my previous statement may have been misinterpreted. I wish to clarify that my concerns do not pertain to funding being a challenge. I did not want to make an assertion about funding in general, and if my words gave that impression, I apologize. I do not know enough about the funding landscape to know whether there is a lot or not enough funding (especially in recent months).
I agree with you that, for all I know, it’s feasible to get funding for independent researchers (and definitely easier than doing a Ph.D. or getting a full-time position). I also agree that independent research seems to be more heavily funded than in other fields.
My point was mainly the following:Many people have joined the field (which is great!), or at least it looks like it from the outside. 80000 hours etc. still recommend switching to AI Alignment, so it seems likely that more people will join.
I believe that there are many opportunities for people to up-skill to a certain level if they want to join the field (Seri Mats, AI safety camp, etc.).
However full-time positions (for example at big labs) are very limited. This also makes sense, since they can only hire so many people a year.
It seems like the most obvious option for people who want to stay in the field is to do independent research (and apply for grants). I think it’s great that people do independent research and that one has the opportunity to get grants.
However, doing independent research is not always ideal for many reasons (as outlined in my main comment). Note I’m not saying it doesn’t make sense at all, it definitely has its merits.
In order to have more full-time positions we need more senior people, who can then fund their organizations, or independently hire people, etc. Independent research does not seem like a promising avenue to me, to groom senior researchers. It’s essential that you can learn from people that are better than you and be in a good environment (yes there are exceptions like Einstein, but I think most researchers I know would agree with that statement).
So to me, the biggest bottleneck of all is how can we get many great researchers and groom them to be senior researchers who can lead their own orgs. I think that so far we have really optimized for getting people into the field (which is great). But we haven’t really found a solution to grooming senior researchers (again, some programs try to do that and I’m aware that this takes time). Overall I believe that this is a hard problem and probably others have already thought about it. I’m just trying to make that point in case nobody has written it up yet. Especially if people are trying to do AI safety field building it seems to me that, coming up with ways to groom senior researchers is a top priority.
Ultimately I’m not even sure whether there is a clear solution to this problem. The field is still very new and it’s amazing what has already happened. It’s probable that it just takes time for the field to mature and people getting more experience. I think I mostly wanted to point this out, even if it is maybe obvious.
My argument here is very related to what jacquesthibs mentions.
Right now it seems like the biggest bottleneck for the AI Alignment field is senior researchers. There are tons of junior people joining the field and I think there are many opportunities for junior people to up-skill and do some programs for a few months (e.g. SERI MATS, MLAB, REMIX, AGI Safety Fundamentals, etc.). The big problem (in my view) is that there are not enough organizations to actually absorb all the rather “junior” people at the moment. My sense is that 80K and most programs encourage people to up-skill and then try to get a job at a big organization (like Deepmind, Anthropic, OpenAI, Conjecture, etc.). Realistically speaking though, these organizations can only absorb a few people in a year. In my experience, it’s extremely competitive to get a job at these organizations even if you’re a more experienced researcher (e.g. having done a couple of years of research, a Ph.D., or similar). This means that while there are many opportunities for junior people to get a stand in the field, there are actually very few paths that actually allow you to have a full-time career in this field (this is also for more experienced researchers who don’t get a big lab). So the bottleneck in my view is not having enough organizations, which is a result of not having enough senior researchers. Funding an org is super hard, you want to have experienced people, with good research taste, and some kind of research agenda. So if you don’t have many senior people in a field, it will be hard to find people that fund those additional orgs.
Now, one career path that many people are currently taking, is being an “independent researcher” and being funded through a grant. I would claim that this is currently the default path for any researcher who do not get a full-time position and want to stay in the field. I believe that there are people out there who will do great as independent researchers and actually contribute to solving problems (e.g. Marius Hobbhahn and John Wenthworth talk bout being an independent researchers). I am however quite skeptical about most people doing independent research without any kind of supervision. I am not saying one can’t make progress, but it’s super hard to do this without a lot of research experience, a structured environment, good supervision, etc. I am especially skeptical about independent researchers becoming great senior researchers if they can’t work with people who are already very experienced and learn from them. Intuitively I think that no other field has junior people independently working without clear structures and supervision, so I feel like my skepticism is warranted.In terms of career capital, being an independent researcher is also very risky. If your research fails, i.e. you don’t get a lot of good output (papers, code libraries, or whatever), “having done independent research for a couple of years” will not sound great in your CV. As a comparison, if you somehow do a very mediocre Ph.D. with no great insights, but you do manage to get the title, at least you have that in your CV (having a Ph.D. can be pretty useful in many cases).
So overall I believe that decision makers and AI field builders should put their main attention on how we can “groom” senior researchers in the field and get more full-time positions through organizations. I don’t claim to have the answers on how to solve this. But it does seem the greatest bottleneck for field building in my opinion. It seems like the field was able to get a lot more people excited about AI safety and to change their careers (we still have by far not enough people though). However right I think that many people are kind of stuck as junior researchers, having done some programs, and not being able to get full-time positions. Note that I am aware that some programs such as SERI MATS do in some sense have the ambition of grooming senior researchers. However, in practice, it still feels like there is a big gap right now.
My background (in case this is useful): I’ve been doing ML research throughout my Bachelor’s and Masters. I’ve worked at FAR AI on “AI alignment” for the last 1.5 years, so I was lucky to get a full-time position. I don’t consider myself a “senior” researcher as defined in this comment, but I definitely have a lot of research experience in the field. From my own experience, it’s pretty hard to find a new full-time position in the field, especially if you are also geographically constrained.
This means that, at least in theory, the out of distribution behaviour of amortized agents can be precisely characterized even before deployment, and is likely to concentrate around previous behaviour. Moreover, the out of distribution generalization capabilities should scale in a predictable way with the capacity of the function approximator, of which we now have precise mathematical characterizations due to scaling laws.
Do you have pointers that explain this part better? I understand that scaling computing and data will improve misgeneralization to some degree (i.e. reduce it). But what is the reasoning why misgeneralization should be predictable, given the capacity and the knowledge of “in-distribution scaling laws”?
Overall I hold the same opinion, that intuitively this should be possible. But empirically I’m not sure whether in-distribution scaling laws can tell us anything about out-of-distribution scaling laws. Surely we can predict that with increasing model & data scale the out-of-distribution misgeneralization will go down. But given that we can’t really quantify all the possible out-of-distribution datasets, it’s hard to make any claims about how precisely it will go down.
That’s interesting!
Yeah, I agree with that assessment. One important difference in RLHF vs fine-tuning is that the former basically generates the training distribution it then trains on. So, the LM will generate an output, and update its gradients based on the reward of that output. So intuitively I think it has a higher likelihood to be optimized towards certain unwanted attractors since the reward model will shape the future outputs it then learns from.With fine-tuning you are just cloning a fixed distribution, and not influencing it (as you say). So I tend to agree that probably unwanted attractors could likely be due to the outputs of RLHF-trained models. I think that we need empirical evidence for this though (to be certain).
Given your statement, I also think that doing those experiments with GPT-3 models is gonna be hard because we basically have no way of telling what data it learned from, how it was generated, etc. So one would need to be more scientific and train various models with various optimization schemes, on known data distributions.
OpenAI has just released a description of how their models work here.
text-davinci-002 is trained with “FeedME” and text-davinci-003 is trained with RLHF (PPO).
“FeedME” is what they call supervised fine-tuning on human-written demonstrations or model samples rated 7⁄7 by human labelers. So basically fine-tuning on high-quality data.
I think your findings are still very interesting. Because they imply that even further finetuning, changes the distribution significantly! Given all this information one could now actually run a systematic comparison of davinci, text-davinci-002 (finetuning), and text-davinci-003 (RLHF) and see how the distribution changes on various tasks.
Let me know if you want help on this, I’m interested in this myself.
I looked at your code (very briefly though), and you mention this weird thing where even the normal model sometimes is completely unaligned (i.e. even in the observed case it takes the action “up” all the time). You say that this sometimes happens and that it depends on the random seed. Not sure (since I don’t fully understand your code), but that might be something to look into since somehow the model could be biased in some way given the loss function.
Why am I mentioning this? Well, why does it happen that the mesa-optimized agent happens to go upward when it’s not supervised anymore? I’m not trying to poke a hole in this, I’m generally just curious. The fact that it can behave out of distribution given all of its knowledge makes sense. But why will it specifically go up, and not down? I mean even if it goes down it still satisfies your criteria of a treacherous turn. But maybe the going up has something to do with this tendency of going up depending on the random seed. So probably this is a nitpick, but just something I’ve been wondering.
I’ll link to the following post that came out a little bit earlier Mysteries of Mode Collapse due to RLHF, which is basically a critique of the whole RLHF approach and the Instruct Models (specifically text-davinci-002).
Would also love to have a look.
I think the terminology you are looking for is called “Deliberate Practice” (just two random links I just found). Many books/podcasts/articles have been written about that topic. The big difference is when you “just do your research” you are executing your skills and trying to achieve the main goal (e.g. answering a research question). Yes, you sometimes need to read textbooks or learn new skills to achieve that, but this learning is usually subordinate to your end goal. Also one could make the argument that if you actually need to invest a few hours into learning, you will probably switch to “deliberate practice mode”.
Deliberate practice is the very intentional action of improving your skill, e.g. sitting down on a piano and improving your technique, learning a new piece. Or improving as a writer by doing intentional exercises, or solving specific math problems that improve a certain skill.
The advantage of deliberate practice is that its main goal is to improve your skill. Also usually you are at the edge of your ability, pushing through difficulties, making the whole endeavor very intense and hard.
So yes, I agree that doing research is important. Especially if you have no experience then getting better at research is usually best done by doing research. However, you still need to do other things that specifically improve subskills. Here are a few examples:become better at coding: e.g. through paired programming, coding reviews, Hacker Rank exercises, side projects, reading books
becoming better at writing: e.g. doing writing exercises (no idea what exactly but I’m sure there’s stuff out there), reviewing stuff you have written, trying to imitate the style of a paper, writing blog posts
becoming better at reading papers: reading lots of papers, summarizing them, presenting them, writing a blog post about them
becoming better at finding good research ideas and being a good researcher: talking to lots of people, reading lots about researchers’ thoughts, Film study for research, etc.
I think by adding terminology I just wanted to make explicit what you mention in your post. It will also make it easier to find resources given the word “deliberate practice”.
“Either”, “or” pairs in text.
Heuristic. If the word either appears in a sentence, wait for the comma and then add an ” or”.What follows are a few examples. Note that the completion is just something I randomly come up with, the important part is the or. Using the webapp, GPT-2 puts a high probability (around 40%-60%) on the token ” or”.
“Either you take a left at the next intersection,” → or take a left after that.
“Either you go to the cinema,” → or you stay at home.
“Tonight I could either order some food,” → or cook something myself.
Counter example:
“Do you rather want to go to Portugal or Italy? Either” → way is fine./one is fine. (GPT-2 puts a lot of probability on ” way”, and barely any on ” or”, which is correct).
ERO: I do buy the argument of Steganography everywhere if you are optimizing for outcomes. As described here (https://www.lesswrong.com/posts/pYcFPMBtQveAjcSfH/supervise-process-not-outcomes) outcome-based optimization is an attractor and will make your sub-compoments uninterpretable. While not guaranteed, I do think that process based optimization might suffer less from steganography (although only experiments will eventually show what happens). Any thoughts on process based optimization?
Shard Theory: Yeah, the word research agenda was maybe wrongly picked. I was mainly trying to refer to research directions/frameworks.
RAT: Agree at the moment this is not feasible.
See above, I don’t have strong views on how to call this. Probably for some things research agenda might be too strong of a word. I appreciate your general comment, this is helpful in better understanding your view on Lesswrong vs. for example peer-reviewing. I think you are right to some degree. There is a lot of content that is mostly about framing and does not provide concrete results. However, I think that sometimes a correct framing is needed for people to actually come up with interesting results, and for making things more concrete. Some examples I like for example are the inner/outer alignment framing (which I think initially didn’t bring any concrete examples), or the recent Simulators (https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators) post. I think in those cases the right framing helps tremendously to make progress with concrete research afterward. Although I agree that grounded, concrete, and result-oriented experimentation is indeed needed to make concrete progress on a problem. So I do understand your point, and it can feel like flag planting in some cases.
Note: I’m also coming from academia, so I definitely understand your view and share it to some degree. However, I’ve personally come to appreciate some posts (usually by great researchers) that allow me to think about the Alignment Problem in a different way.
I read “Film Study for Research” just the other day (https://bounded-regret.ghost.io/film-study/, recommended by Jacob Steinhardt). In retrospect I realized that a lot of the posts here give a window into the rather “raw & unfiltered thinking process” of various researchers, which I think is a great way to practice research film study.
Thanks for your thoughts, really appreciate it.
One quick follow-up question, when you say “build powerful AI tools that are deceptive” as a way of “the problem being easier than anticipated”, how exactly do you mean that? Do you say that as in, if we can create deceptive or power-seeking tool AI very easily, it will be much simpler to investigate what is happening and derive solutions?
Here are some links to the concepts you asked about.
Externalized Reasoning Oversight: This was also recently introduced https://www.lesswrong.com/posts/FRRb6Gqem8k69ocbi/externalized-reasoning-oversight-a-research-direction-for . The main idea is to use Chain-of-though reasoning to oversee the thought processes of your model (assuming that those thought processes are complete and straightforward, and the output causally depends on it).
Shard Theory: https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-human-values. It was proposed very recently. Their TL;DR is “We propose a theory of human value formation. According to this theory, the reward system shapes human values in a relatively straightforward manner. Human values are not e.g. an incredibly complicated, genetically hard-coded set of drives, but rather sets of contextually activated heuristics which were shaped by and bootstrapped from crude, genetically hard-coded reward circuitry. ”Relaxed Adversarial Training: I think the main post is this one https://www.lesswrong.com/posts/9Dy5YRaoCxH9zuJqa/relaxed-adversarial-training-for-inner-alignment . But I really like the short description by Beth (https://www.lesswrong.com/posts/YQALrtMkeqemAF5GX/another-list-of-theories-of-impact-for-interpretability):
“The basic idea of relaxed adversarial training is something like:
A deceptive model may be waiting for an input that strongly indicates it’s in the deployment environment rather than in a test—for example, one that’s computationally hard to produce
This makes generating adversarial examples that trigger the defection very hard
Instead of searching over inputs, we want to search over something like ‘the model’s beliefs about what it has seen’, to see if there’s anything it could have seen that would cause it to defect”
I have two questions I’d love to hear your thoughts about.
1. What is the overarching/high-level research agenda of your group? Do you have a concrete alignment agenda where people work on the same thing or do people work on many unrelated things?2. What are your thoughts on various research agendas to solve the alignment that exists today? Why do you think they will fall short of their goal? What are you most excited about?
Feel free to talk about any agendas, but I’ll just list a few that come to my mind (in no particular order).
IDA, Debate, Interpretability (I read a tweet I think, where you said you are rather skeptical about this), Natural Abstraction Hypothesis, Externalized Reasoning Oversight, Shard Theory, (Relaxed) Adversarial Training, ELK, etc.
Thanks for the pointer, thats quite useful. Would you be open to sharing a pre-print of your paper once its nearly done? I’d be super curious to see how exactly you do Mitraining, SDF, SFT. If yes feel free to reach out to jeremy@apolloresearch.ai.
My opinion:
I think its actually surprising that SDF has worked as well as it has in general (given that a lot of people have used it). Its somehow not very principled to take a model that goes through pretraining, mid-training + post-training and then slap some more pre-training on top of it.
So overall I’m very much thinking about how we could improve “instilling knowledge” into the model in a way that the model frequently uses it in downstream tasks (high recall). In a way this is basically a mid-training problem, i.e. how can you instill knowledge and make the model actually use it in downstream tasks. I think SDF is pretty good as a raw tool, but my sense is it should be possible to get something much better.