Thanks a lot for this helpful comment! You are absolutely right; the citations refer to goal misgeneralization which is a problem of inner alignment, whereas goal misspecificatin is related to outer alignment. I have updated the post to reflect this.
Seems to me like this is easily resolved so long as you don’t screw up your book keeping. In your example, the hypothesis implicitly only makes a claim about the information going out of the bubble. So long as you always write down which nodes or layers of the network your hypothesis makes what claims about, I think this should be fine?
Yes totally agree. Here we are not claiming that this is a failure mode of CaSc, and it can “easily” be resolved by making your hypothesis more specific. We are merely pointing out that “In theory, this is a trivial point, but we found that in practice, it is easy to miss this distinction when there is an “obvious” algorithm to implement a given function.”I don’t know that much about CaSc, but why are you comparing the ablated graphs to the originals via their separate loss on the data in the first place? Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.
You are right that this is a failure mode that is mostly due to reducing the behavior down into a single aggregate quantity like the average loss recovered. It can be remedied when looking at the loss on individual samples and not averaging the metric across the whole dataset. In the footnote, we point out that researchers at Redwood Research have actually also started looking at the per-sample loss instead of the aggregate loss.
CaSc was, however, introduced by looking at the average scrubbed loss (even though they say that this metric is not ideal). Also, in practice, when one iterates on generating hypotheses and testing them with CaSc, it’s more convenient to look at aggregate metrics. We thus think it is useful to have concrete examples that show how this can lead to problems.Your suggestion of using DKL seems a useful improvement compared to most metrics. It’s, however, still possible that cancellation could occur. Cancellation is mostly due to aggregating over a metric (e.g., the mean) and less due to the specific metric used (although I could imagine that some metrics like DKL could allow for less ambiguity).
Thanks for your comments Akash. I think I have two main points I want to address.
I agree that it’s very good that the field of AI Alignment is very competitive! I did not want to imply that this is a bad thing. I was mainly trying to point out that from my point of view, it seems like overall there are more qualified and experienced people than there are jobs at large organizations. And in order to fill that gap we would need more senior researchers, who then can follow their research agendas and hire people (and fund orgs), which is however hard to achieve. One disclaimer I want to note is that I do not work at a large org, and I do not precisely know what kinds of hiring criteria they have, i.e. it is possible that in their view we still lack talented enough people. However, from the outside, it definitely does look like there are many experienced researchers.
It is possible that my previous statement may have been misinterpreted. I wish to clarify that my concerns do not pertain to funding being a challenge. I did not want to make an assertion about funding in general, and if my words gave that impression, I apologize. I do not know enough about the funding landscape to know whether there is a lot or not enough funding (especially in recent months). I agree with you that, for all I know, it’s feasible to get funding for independent researchers (and definitely easier than doing a Ph.D. or getting a full-time position). I also agree that independent research seems to be more heavily funded than in other fields.My point was mainly the following:
Many people have joined the field (which is great!), or at least it looks like it from the outside. 80000 hours etc. still recommend switching to AI Alignment, so it seems likely that more people will join.
I believe that there are many opportunities for people to up-skill to a certain level if they want to join the field (Seri Mats, AI safety camp, etc.).
However full-time positions (for example at big labs) are very limited. This also makes sense, since they can only hire so many people a year.
It seems like the most obvious option for people who want to stay in the field is to do independent research (and apply for grants). I think it’s great that people do independent research and that one has the opportunity to get grants.
However, doing independent research is not always ideal for many reasons (as outlined in my main comment). Note I’m not saying it doesn’t make sense at all, it definitely has its merits.
In order to have more full-time positions we need more senior people, who can then fund their organizations, or independently hire people, etc. Independent research does not seem like a promising avenue to me, to groom senior researchers. It’s essential that you can learn from people that are better than you and be in a good environment (yes there are exceptions like Einstein, but I think most researchers I know would agree with that statement).
So to me, the biggest bottleneck of all is how can we get many great researchers and groom them to be senior researchers who can lead their own orgs. I think that so far we have really optimized for getting people into the field (which is great). But we haven’t really found a solution to grooming senior researchers (again, some programs try to do that and I’m aware that this takes time). Overall I believe that this is a hard problem and probably others have already thought about it. I’m just trying to make that point in case nobody has written it up yet. Especially if people are trying to do AI safety field building it seems to me that, coming up with ways to groom senior researchers is a top priority.
Ultimately I’m not even sure whether there is a clear solution to this problem. The field is still very new and it’s amazing what has already happened. It’s probable that it just takes time for the field to mature and people getting more experience. I think I mostly wanted to point this out, even if it is maybe obvious.
My argument here is very related to what jacquesthibs mentions.Right now it seems like the biggest bottleneck for the AI Alignment field is senior researchers. There are tons of junior people joining the field and I think there are many opportunities for junior people to up-skill and do some programs for a few months (e.g. SERI MATS, MLAB, REMIX, AGI Safety Fundamentals, etc.). The big problem (in my view) is that there are not enough organizations to actually absorb all the rather “junior” people at the moment. My sense is that 80K and most programs encourage people to up-skill and then try to get a job at a big organization (like Deepmind, Anthropic, OpenAI, Conjecture, etc.). Realistically speaking though, these organizations can only absorb a few people in a year. In my experience, it’s extremely competitive to get a job at these organizations even if you’re a more experienced researcher (e.g. having done a couple of years of research, a Ph.D., or similar). This means that while there are many opportunities for junior people to get a stand in the field, there are actually very few paths that actually allow you to have a full-time career in this field (this is also for more experienced researchers who don’t get a big lab). So the bottleneck in my view is not having enough organizations, which is a result of not having enough senior researchers. Funding an org is super hard, you want to have experienced people, with good research taste, and some kind of research agenda. So if you don’t have many senior people in a field, it will be hard to find people that fund those additional orgs.Now, one career path that many people are currently taking, is being an “independent researcher” and being funded through a grant. I would claim that this is currently the default path for any researcher who do not get a full-time position and want to stay in the field. I believe that there are people out there who will do great as independent researchers and actually contribute to solving problems (e.g. Marius Hobbhahn and John Wenthworth talk bout being an independent researchers). I am however quite skeptical about most people doing independent research without any kind of supervision. I am not saying one can’t make progress, but it’s super hard to do this without a lot of research experience, a structured environment, good supervision, etc. I am especially skeptical about independent researchers becoming great senior researchers if they can’t work with people who are already very experienced and learn from them. Intuitively I think that no other field has junior people independently working without clear structures and supervision, so I feel like my skepticism is warranted.
In terms of career capital, being an independent researcher is also very risky. If your research fails, i.e. you don’t get a lot of good output (papers, code libraries, or whatever), “having done independent research for a couple of years” will not sound great in your CV. As a comparison, if you somehow do a very mediocre Ph.D. with no great insights, but you do manage to get the title, at least you have that in your CV (having a Ph.D. can be pretty useful in many cases).
So overall I believe that decision makers and AI field builders should put their main attention on how we can “groom” senior researchers in the field and get more full-time positions through organizations. I don’t claim to have the answers on how to solve this. But it does seem the greatest bottleneck for field building in my opinion. It seems like the field was able to get a lot more people excited about AI safety and to change their careers (we still have by far not enough people though). However right I think that many people are kind of stuck as junior researchers, having done some programs, and not being able to get full-time positions. Note that I am aware that some programs such as SERI MATS do in some sense have the ambition of grooming senior researchers. However, in practice, it still feels like there is a big gap right now.My background (in case this is useful): I’ve been doing ML research throughout my Bachelor’s and Masters. I’ve worked at FAR AI on “AI alignment” for the last 1.5 years, so I was lucky to get a full-time position. I don’t consider myself a “senior” researcher as defined in this comment, but I definitely have a lot of research experience in the field. From my own experience, it’s pretty hard to find a new full-time position in the field, especially if you are also geographically constrained.
This means that, at least in theory, the out of distribution behaviour of amortized agents can be precisely characterized even before deployment, and is likely to concentrate around previous behaviour. Moreover, the out of distribution generalization capabilities should scale in a predictable way with the capacity of the function approximator, of which we now have precise mathematical characterizations due to scaling laws.
Do you have pointers that explain this part better? I understand that scaling computing and data will improve misgeneralization to some degree (i.e. reduce it). But what is the reasoning why misgeneralization should be predictable, given the capacity and the knowledge of “in-distribution scaling laws”?Overall I hold the same opinion, that intuitively this should be possible. But empirically I’m not sure whether in-distribution scaling laws can tell us anything about out-of-distribution scaling laws. Surely we can predict that with increasing model & data scale the out-of-distribution misgeneralization will go down. But given that we can’t really quantify all the possible out-of-distribution datasets, it’s hard to make any claims about how precisely it will go down.
That’s interesting!Yeah, I agree with that assessment. One important difference in RLHF vs fine-tuning is that the former basically generates the training distribution it then trains on. So, the LM will generate an output, and update its gradients based on the reward of that output. So intuitively I think it has a higher likelihood to be optimized towards certain unwanted attractors since the reward model will shape the future outputs it then learns from.
With fine-tuning you are just cloning a fixed distribution, and not influencing it (as you say). So I tend to agree that probably unwanted attractors could likely be due to the outputs of RLHF-trained models. I think that we need empirical evidence for this though (to be certain).
Given your statement, I also think that doing those experiments with GPT-3 models is gonna be hard because we basically have no way of telling what data it learned from, how it was generated, etc. So one would need to be more scientific and train various models with various optimization schemes, on known data distributions.
OpenAI has just released a description of how their models work here. text-davinci-002 is trained with “FeedME” and text-davinci-003 is trained with RLHF (PPO). “FeedME” is what they call supervised fine-tuning on human-written demonstrations or model samples rated 7⁄7 by human labelers. So basically fine-tuning on high-quality data.I think your findings are still very interesting. Because they imply that even further finetuning, changes the distribution significantly! Given all this information one could now actually run a systematic comparison of davinci, text-davinci-002 (finetuning), and text-davinci-003 (RLHF) and see how the distribution changes on various tasks. Let me know if you want help on this, I’m interested in this myself.
I looked at your code (very briefly though), and you mention this weird thing where even the normal model sometimes is completely unaligned (i.e. even in the observed case it takes the action “up” all the time). You say that this sometimes happens and that it depends on the random seed. Not sure (since I don’t fully understand your code), but that might be something to look into since somehow the model could be biased in some way given the loss function.
Why am I mentioning this? Well, why does it happen that the mesa-optimized agent happens to go upward when it’s not supervised anymore? I’m not trying to poke a hole in this, I’m generally just curious. The fact that it can behave out of distribution given all of its knowledge makes sense. But why will it specifically go up, and not down? I mean even if it goes down it still satisfies your criteria of a treacherous turn. But maybe the going up has something to do with this tendency of going up depending on the random seed. So probably this is a nitpick, but just something I’ve been wondering.
I’ll link to the following post that came out a little bit earlier Mysteries of Mode Collapse due to RLHF, which is basically a critique of the whole RLHF approach and the Instruct Models (specifically text-davinci-002).
Would also love to have a look.
I think the terminology you are looking for is called “Deliberate Practice” (just two random links I just found). Many books/podcasts/articles have been written about that topic. The big difference is when you “just do your research” you are executing your skills and trying to achieve the main goal (e.g. answering a research question). Yes, you sometimes need to read textbooks or learn new skills to achieve that, but this learning is usually subordinate to your end goal. Also one could make the argument that if you actually need to invest a few hours into learning, you will probably switch to “deliberate practice mode”. Deliberate practice is the very intentional action of improving your skill, e.g. sitting down on a piano and improving your technique, learning a new piece. Or improving as a writer by doing intentional exercises, or solving specific math problems that improve a certain skill. The advantage of deliberate practice is that its main goal is to improve your skill. Also usually you are at the edge of your ability, pushing through difficulties, making the whole endeavor very intense and hard.So yes, I agree that doing research is important. Especially if you have no experience then getting better at research is usually best done by doing research. However, you still need to do other things that specifically improve subskills. Here are a few examples:
become better at coding: e.g. through paired programming, coding reviews, Hacker Rank exercises, side projects, reading books
becoming better at writing: e.g. doing writing exercises (no idea what exactly but I’m sure there’s stuff out there), reviewing stuff you have written, trying to imitate the style of a paper, writing blog posts
becoming better at reading papers: reading lots of papers, summarizing them, presenting them, writing a blog post about them
becoming better at finding good research ideas and being a good researcher: talking to lots of people, reading lots about researchers’ thoughts, Film study for research, etc.
I think by adding terminology I just wanted to make explicit what you mention in your post. It will also make it easier to find resources given the word “deliberate practice”.
“Either”, “or” pairs in text. Heuristic. If the word either appears in a sentence, wait for the comma and then add an ” or”.
What follows are a few examples. Note that the completion is just something I randomly come up with, the important part is the or. Using the webapp, GPT-2 puts a high probability (around 40%-60%) on the token ” or”. “Either you take a left at the next intersection,” → or take a left after that. “Either you go to the cinema,” → or you stay at home. “Tonight I could either order some food,” → or cook something myself.Counter example: “Do you rather want to go to Portugal or Italy? Either” → way is fine./one is fine. (GPT-2 puts a lot of probability on ” way”, and barely any on ” or”, which is correct).
ERO: I do buy the argument of Steganography everywhere if you are optimizing for outcomes. As described here (https://www.lesswrong.com/posts/pYcFPMBtQveAjcSfH/supervise-process-not-outcomes) outcome-based optimization is an attractor and will make your sub-compoments uninterpretable. While not guaranteed, I do think that process based optimization might suffer less from steganography (although only experiments will eventually show what happens). Any thoughts on process based optimization?
Shard Theory: Yeah, the word research agenda was maybe wrongly picked. I was mainly trying to refer to research directions/frameworks. RAT: Agree at the moment this is not feasible.See above, I don’t have strong views on how to call this. Probably for some things research agenda might be too strong of a word. I appreciate your general comment, this is helpful in better understanding your view on Lesswrong vs. for example peer-reviewing. I think you are right to some degree. There is a lot of content that is mostly about framing and does not provide concrete results. However, I think that sometimes a correct framing is needed for people to actually come up with interesting results, and for making things more concrete. Some examples I like for example are the inner/outer alignment framing (which I think initially didn’t bring any concrete examples), or the recent Simulators (https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators) post. I think in those cases the right framing helps tremendously to make progress with concrete research afterward. Although I agree that grounded, concrete, and result-oriented experimentation is indeed needed to make concrete progress on a problem. So I do understand your point, and it can feel like flag planting in some cases.Note: I’m also coming from academia, so I definitely understand your view and share it to some degree. However, I’ve personally come to appreciate some posts (usually by great researchers) that allow me to think about the Alignment Problem in a different way.I read “Film Study for Research” just the other day (https://bounded-regret.ghost.io/film-study/, recommended by Jacob Steinhardt). In retrospect I realized that a lot of the posts here give a window into the rather “raw & unfiltered thinking process” of various researchers, which I think is a great way to practice research film study.
Thanks for your thoughts, really appreciate it.
One quick follow-up question, when you say “build powerful AI tools that are deceptive” as a way of “the problem being easier than anticipated”, how exactly do you mean that? Do you say that as in, if we can create deceptive or power-seeking tool AI very easily, it will be much simpler to investigate what is happening and derive solutions? Here are some links to the concepts you asked about. Externalized Reasoning Oversight: This was also recently introduced https://www.lesswrong.com/posts/FRRb6Gqem8k69ocbi/externalized-reasoning-oversight-a-research-direction-for . The main idea is to use Chain-of-though reasoning to oversee the thought processes of your model (assuming that those thought processes are complete and straightforward, and the output causally depends on it).
Shard Theory: https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-human-values. It was proposed very recently. Their TL;DR is “We propose a theory of human value formation. According to this theory, the reward system shapes human values in a relatively straightforward manner. Human values are not e.g. an incredibly complicated, genetically hard-coded set of drives, but rather sets of contextually activated heuristics which were shaped by and bootstrapped from crude, genetically hard-coded reward circuitry. ”
Relaxed Adversarial Training: I think the main post is this one https://www.lesswrong.com/posts/9Dy5YRaoCxH9zuJqa/relaxed-adversarial-training-for-inner-alignment . But I really like the short description by Beth (https://www.lesswrong.com/posts/YQALrtMkeqemAF5GX/another-list-of-theories-of-impact-for-interpretability):
“The basic idea of relaxed adversarial training is something like:
A deceptive model may be waiting for an input that strongly indicates it’s in the deployment environment rather than in a test—for example, one that’s computationally hard to produce
This makes generating adversarial examples that trigger the defection very hard
Instead of searching over inputs, we want to search over something like ‘the model’s beliefs about what it has seen’, to see if there’s anything it could have seen that would cause it to defect”
I have two questions I’d love to hear your thoughts about.1. What is the overarching/high-level research agenda of your group? Do you have a concrete alignment agenda where people work on the same thing or do people work on many unrelated things?
2. What are your thoughts on various research agendas to solve the alignment that exists today? Why do you think they will fall short of their goal? What are you most excited about?
Feel free to talk about any agendas, but I’ll just list a few that come to my mind (in no particular order).
IDA, Debate, Interpretability (I read a tweet I think, where you said you are rather skeptical about this), Natural Abstraction Hypothesis, Externalized Reasoning Oversight, Shard Theory, (Relaxed) Adversarial Training, ELK, etc.
Finetuning LLMs with RL seems to make them more agentic. We will look at the changes RL makes to LLMs’ weights; we can see how localized the changes are, get information about what sorts of computations make something agentic, and make conjectures about selected systems, giving us a better understanding of agency.
Could you elaborate on how you measure the “agenticness” of a model in this experiment? In case you don’t want to talk about it until you finish the project that’s also fine, just thought I’d ask.