(EDIT: I’m already seeing downvotes of the post, it was originally at 58 AF karma. This wasn’t my intention: I think this is a failure of the community as a whole, not of the author.)
Okay, this has gotten enough karma and has been curated and has influenced another post, so I suppose I should engage, especially since I’m not planning to put this in the Alignment Newsletter.
The linked paper studies bandit problems, where each episode of RL is a new bandit problem where the agent doesn’t know which arm gives maximal reward. Unsurprisingly, the agent learns to first explore, and then exploit the best arm. This is a simple consequence of the fact that you have to look at observations to figure out what to do. Basic POMDP theory will tell you that when you have partial observability your policy needs to depend on history, i.e. it needs to learn.
However, because bandit problems have been studied in the AI literature, and “learning algorithms” have been proposed to solve bandit problems, this very normal fact of a policy depending on observation history is now trotted out as “learning algorithms spontaneously emerge”. I don’t understand why this was surprising to the original researchers, it seems like if you just thought about what the optimal policy would be given the observable information, you would make exactly this prediction. Perhaps it’s because it’s primarily a neuroscience paper, and they weren’t very familiar with AI.
More broadly, I don’t understand what people are talking about when they speak of the “likelihood” of mesa optimization. If you mean the chance that the weights of a neural network are going to encode some search algorithm, then this paper should be ~zero evidence in favor of it. If you mean the chance than a policy trained by RL will “learn” without gradient descent, I can’t imagine a way that could fail to be true for an intelligent system trained by deep RL—presumably a system that is intelligent is capable of learning quickly, and when we talk about deep RL leading to an intelligent AI system, presumably we are talking about the policy being intelligent (what else?), therefore the policy must “learn” as it is being executed.
Gwern notes here that we’ve seen this elsewhere. This is because it’s exactly what you’d expect, just that in the other cases we call conditioning on observations “adaptation” rather than “learning”.
----
Meta: I’m disappointed that I had to be the one to point this out. (Though to be fair, Gwern clearly understands this point.) There’s clearly been a lot of engagement with this post, and yet this seemingly obvious point hasn’t been said. When I saw this post first come up, my immediate reaction was “oh I’m sure this is a typical LW example of a case where the optimal policy is interpreted as learning, I’m not even going to bother clicking on the link”. Do we really have so few people who understand machine learning, that of the many, many views this post must have had, not one person could figure this out? It’s really no surprise that ML researchers ignore us if this is the level of ML understanding we as a community have.
EDIT: I should give credit to Nevan for pointing out that this paper is not much evidence in favor of the hypothesis that the neural network weights encode some search algorithm (before I wrote this comment).
I note that this doesn’t feel like a problem to me, mostly because of reasons related to Explainers Shoot High. Aim Low!. Even among ML experts, many of them haven’t touched much RL, because they’re focused on another field. Why expect them to know basic RL theory, or to have connected that to all the other things that they know?
More broadly, I don’t understand what people are talking about when they speak of the “likelihood” of mesa optimization.
I don’t think I have a fully crisp view of this, but here’s my frame on it so far:
One view is that we design algorithms to do things, and those algorithms have properties that we can reason about. Another is that we design loss functions, and then search through random options for things that perform well on those loss functions. In the second view, often which options we search through doesn’t matter very much, because there’s something like the “optimal solution” that all things we actually find will be trying to approximate in one way or another.
Mesa-optimization is something like, “when we search through the options, will we find something that itself searches through a different set of options?”. Some of those searches are probably benign—the bandit algorithm updating its internal value function in response to evidence, for example—and some of those searches are probably malign (or, at least, dangerous). In particular, we might think we have restrictions on the behavior of the base-level optimizer that turn out to not apply to any subprocesses it manages to generate, and so those properties don’t actually hold overall.
But it seems to me like overall we’re somewhat confused about this. For example, the way I normally use the word “search”, it doesn’t apply to the bandit algorithm updating its internal value function. But does Abram’s distinction between mesa-search and mesa-control actually mean much? There’s lots of problems that you can solve exactly with calculus, and solve approximately with well-tuned simple linear estimators, and thus saying “oh, it can’t do calculus, it can only do linear estimates” won’t rule out it having a really good solution; presumably a similar thing could be true with “search” vs. “control,” where in fact you might be able to build a pretty good search-approximator out of elements that only do control.
So, what would it mean to talk about the “likelihood” of mesa optimization? Well, I remember a few years back when there was a lot of buzz about hierarchical RL. That is, you would have something like a policy for which ‘tactic’ (or ‘sub-policy’ or whatever you want to call it) to deploy, and then each ‘tactic’ is itself a policy for what action to take. In 2015, it would have been sensible to talk about the ‘likelihood’ of RL models in 2020 being organized that way. (Even now, we can talk about the likelihood that models in 2025 will be organized that way!) But, empirically, this seems to have mostly not helped (at least as we’ve tried it so far).
As we imagine deploying more complicated models, it feels like there are two broad classes of things that can happen during runtime:
‘Task location’, where they know what to do in a wide range of environments, and all they’re learning is which environment they’re in. The multi-armed bandit is definitely in this case; GPT-3 seems like it’s mostly doing this.
‘Task learning’, where they are running some sort of online learning process that gives them ‘new capabilities’ as they encounter new bits of the world.
The two blur into each other; you can imagine training a model to deal with a range of situations, and yet it also performs well on situations not seen in training (that are interpolations between situations it has seen, or where the old abstractions apply correctly, and thus aren’t “entirely new” situations). Just like some people argue that anything we know how to do isn’t “artificial intelligence”, you might get into a situation where anything we know how to do is task ‘location’ instead of task ‘learning.’
But to the extent that our safety guarantees rely on the lack of capability in an AI system, any ability for the AI system to do learning instead of location means that it may gain capabilities we didn’t expect it to have. That said, merely restricting it to ‘location’ may not help us very much, because if we misunderstand the abstractions that govern the system’s generalizability, we may underestimate what capabilities it will or won’t have.
There’s clearly been a lot of engagement with this post, and yet this seemingly obvious point hasn’t been said.
I think people often underestimate the degree to which, if they want to see their opinions in a public forum, they will have to be the one to post them. This is both because some points are less widely understood than you might think, and because even if the someone understands the point, that doesn’t mean it connects to their interests in a way that would make them say anything about it.
I note that this doesn’t feel like a problem to me, mostly because of reasons related to Explainers Shoot High. Aim Low!. Even among ML experts, many of them haven’t touched much RL, because they’re focused on another field. Why expect them to know basic RL theory, or to have connected that to all the other things that they know?
I’m perfectly happy with good explanations that don’t assume background knowledge. The flaw I am pointing to has nothing to do with explanations. It is that despite this evidence being a clear consequence of basic RL theory, for some reason readers are treating it as important evidence. Clearly I should update negatively on things-AF-considers-important. At a more gears level, presumably I should update towards some combination of:
AF readers don’t know RL.
AF readers upvote anything that’s cheering for their team.
AF readers automatically believe anything written in a post without checking it
Any of these would be a pretty damning critique of the forum. And the update should be fairly strong, given that this was (prior to my comment) the highest-upvoted post ever by AF karma.
I think people often underestimate the degree to which, if they want to see their opinions in a public forum, they will have to be the one to post them.
If you saw a post that ran an experiment where they put their hand in boiling water, and the conclusion was “boiling water is dangerous”, and you saw it get to be the most upvoted post ever on LessWrong, with future posts citing it as evidence for boiling water being dangerous, would your reaction be “huh, I guess I need to state my opinion that this is obvious”?
There’s a difference between “I’m surprised no one has made this connection to this other thing” and “I’m surprised that readers are updating on facts that I expected them to already know”.
I don’t usually expect my opinions to show up on a public forum. For example, I am continually sad but not surprised about the fact that AF focuses on mesa optimizers as separate from capability generalization without objective generalization.
I guess I should explain why I upvoted this post despite agreeing with you that it’s not new evidence in favor of mesa-optimization. I actually had a conversation about this post with Adam Shimi prior to you commenting on it where I explained to him that I thought that not only was none of it new but also that it wasn’t evidence about the internal structure of models and therefore wasn’t really evidence about mesa-optimization. Nevertheless, I chose to upvote the post and not comment my thoughts on it. Some reasons why I did that:
I generally upvote most attempts on LW/AF to engage with the academic literature—I think that LW/AF would generally benefit from engaging with academia more and I like to do what I can to encourage that when I see it.
I didn’t feel like any comment I would have made would have anything more to say than things I’ve said in the past. In fact, in “Risks from Learned Optimization” itself, we talk about both a) why we chose to be agnostic about whether current systems exhibit mesa-optimization due to the difficulty of determining whether a system is actually implementing search or not (link) and b) examples of current work that we thought did seem to come closest to being evidence of mesa-optimization such as RL^2 (and I think RL^2 is a better example than the work linked here) (link).
(Flagging that I curated the post, but was mostly relying on Ben and Habryka’s judgment, in part since I didn’t see much disagreement. Since this discussion I’ve become more agnostic about how important this post is)
One thing this comment makes me want is more nuanced reacts that people have affordance to communicate how they feel about a post, in a way that’s easier to aggregate.
Though I also notice that with this particular post it’s a bit unclear what the react would be appropriate, since it sounds like it’s not “disagree” so much as “this post seems confused” or something.
FWIW, I appreciated that your curation notice explicitly includes the desire for more commentary on the results, and that curating it seems to have been a contributor to there being more commentary.
I didn’t feel like any comment I would have made would have anything more to say than things I’ve said in the past.
FWIW, I say: don’t let that stop you! (Don’t be afraid to repeat yourself, especially if there’s evidence that the point has not been widely appreciated.)
Unfortunately, I also only have so much time, and I don’t generally think that repeating myself regularly in AF/LW comments is a super great use of it.
The solution is clear: someone needs to create an Evan bot that will comment on every post of the AF related to mesa-optimization, by providing the right pointers to the paper.
Fair enough, those are sensible reasons. I don’t like the fact that the incentive gradient points away from making intellectual progress, but it’s not an obvious choice.
And the update should be fairly strong, given that this was (prior to my comment) the highest-upvoted post ever by AF karma.
Given karma inflation (as users gain more karma, their votes are worth more, but this doesn’t propagate backwards to earlier votes they cast, and more people become AF voters than lose AF voter status), I think the karma differences between this post and these other 4 50+ karma posts [1234] are basically noise. So I think the actual question is “is this post really in that tier?”, to which “probably not” seems like a fair answer.
[I am thinking more about other points you’ve made, but it seemed worth writing a short reply on that point.]
it feels like there are two broad classes of things that can happen during runtime:
I agree this sort of thing is something you could mean by the “likelihood of mesa optimization”. As I said in the grandparent:
this paper should be ~zero evidence in favor of [mesa optimization in the task learning sense].
In practice, when people say they “updated in favor of mesa optimization”, they refer to evidence that says approximately nothing about what is “happening at runtime”; therefore I infer that they cannot be talking about mesa optimization in the sense you mean.
I appreciate you writing this, Rohin. I don’t work in ML, or do safety research, and it’s certainly possible I misunderstand how this meta-RL architecture works, or that I misunderstand what’s normal.
That said, I feel confused by a number of your arguments, so I’m working on a reply. Before I post it, I’d be grateful if you could help me make sure I understand your objections, so as to avoid accidentally publishing a long post in response to a position nobody holds.
I currently understand you to be making four main claims:
The system is just doing the totally normal thing “conditioning on observations,” rather than something it makes sense to describe as “giving rise to a separate learning algorithm.”
It is probably not the case that in this system, “learning is implemented in neural activation changes rather than neural weight changes.”
The system does not encode a search algorithm, so it provides “~zero evidence” about e.g. the hypothesis that mesa-optimization is convergently useful, or likely to be a common feature of future systems.
The above facts should be obvious to people familiar with ML.
Does this summary feel like it reasonably characterizes your objections?
I appreciate you writing this, Rohin. I don’t work in ML, or do safety research, and it’s certainly possible I misunderstand how this meta-RL architecture works, or that I misunderstand what’s normal.
Thanks. I know I came off pretty confrontational, sorry about that. I didn’t mean to target you specifically; I really do see this as bad at the community level but fine at the individual level.
I don’t think you’ve exactly captured what I meant, some comments below.
The system is just doing the totally normal thing “conditioning on observations,” rather than something it makes sense to describe as “giving rise to a separate learning algorithm.”
I think it is reasonable to describe it both as “conditioning on observations” and as “giving rise to a separate learning algorithm”.
It is probably not the case that in this system, “learning is implemented in neural activation changes rather than neural weight changes.”
On my interpretation of “learning” in this context, I would agree with that claim (i.e. I agree that learning is implemented in activation changes rather then weight changes via gradient descent). Idk what other people mean by “learning” though.
The system does not encode a search algorithm, so it provides “~zero evidence” about e.g. the hypothesis that mesa-optimization is convergently useful, or likely to be a common feature of future systems.
This sounds roughly right if you use the words as I mean them, but I suspect you aren’t using the words as I mean them.
There’s this thing where the mesa-optimization paper talks about a neural net that performs “search” via activation changes. When I read the paper, I took this to be an illustrative example, that was meant to stand in for “learning” more broadly, but that made more concrete and easier to reason about. (I didn’t think this consciously.) However, whenever I talk to people about this paper, they have different understandings of what is meant by “search”, and varying opinions on how much mesa optimization should be tied to “search”. But I think the typical opinion is that whether or not mesa optimization is happening depends on what algorithm the neural net weights encode, and you can’t deduce whether mesa optimization is happening just by looking at the behavior in the training environment, as it may just have “memorized” what good behavior is rather than “performing search”.
If you use this meaning of “search algorithm”, then you can’t tell whether a good policy is a “search algorithm” or not just by looking at behavior. Since this paper only talked about behavior of a good policy, it can’t be evidence in favor of “mesa-optimization-via-search-algorithm”.
The above facts should be obvious to people familiar with ML.
Oh definitely not those, most people in ML have never heard of “mesa optimization”.
----
I think my response to Vaniver better illustrates my concerns, but let me take a stab at making a simple list of claims.
1. The optimal policy in the bandit environment considered in the paper requires keeping track of the rewards you have gotten in the past, and basing your future decisions on this information.
2. You shouldn’t be surprised when applying an RL algorithm to a problem leads to a near-optimal policy for that problem. (This has many caveats, but they aren’t particularly relevant.)
3. Therefore, you shouldn’t be surprised by the results in this paper.
4. Therefore, you shouldn’t be updating based on this paper.
5. Claims 1 and 2 require only basic knowledge about RL.
I feel confused about why, on this model, the researchers were surprised that this occurred, and seem to think it was a novel finding that it will inevitably occur given the three conditions described. Above, you mentioned the hypothesis that maybe they just weren’t very familiar with AI. But looking at the author list, and their publications (e.g.1, 2, 3, 4, 5, 6, 7, 8), this seems implausible to me. Most of the co-authors are neuroscientists by training, but a few have CS degrees, and all but one have co-authored previous ML papers. It’s hard for me to imagine their surprise was due to them lacking basic knowledge about RL?
Also, this OpenAI paper (whose authors seem quite familiar with ML)—which the summary of Wang et al. on DeepMind’s website describes as “closely related work,” and which appears to me to involve a very similar setup— describes their result similarly:
We structure the agent as a recurrent neural network, which receives past rewards, actions, and termination flags as inputs in addition to the normally received observations. Furthermore, its internal state is preserved across episodes, so that it has the capacity to perform learning in its own hidden activations. The learned agent thus also acts as the learning algorithm, and can adapt to the task at hand when deployed.
As I understand it, the OpenAI authors also think they can gather evidence about the structure of the algorithm simply by looking at its behavior. Given a similar series of experiments (mostly bandit tasks, but also a maze solver), they conclude:
the dynamics of the recurrent network come to implement a learning algorithm entirely separate from the one used to train the network weights… the procedure the recurrent network implements is itself a full-fledged reinforcement learning algorithm, which negotiates the exploration-exploitation tradeoff and improves the agent’s policy based on reward outcomes… this learned RL procedure can differ starkly from the algorithm used to train the network’s weights.
They then run an experiment designed specifically to distinguish whether meta-RL was giving rise to a model-free system, or “a model-based system which learns an internal model of the environment and evaluates the value of actions at the time of decision-making through look-ahead planning,” and suggest the evidence implies the latter. This sounds like a description of search to me—do you think I’m confused?
I get the impression from your comments that you think it’s naive to describe this result as “learning algorithms spontaneously emerging.” You describe the lack of LW/AF pushback against that description as “a community-wide failure,” and mention updating as a result toward thinking AF members “automatically believe anything written in a post without checking it.”
But my impression is that OpenAI describes their similar result in a similar way. Do you think my impression is wrong? Or that e.g. their description is also misleading?
--
I’ve been feeling very confused lately about how people talk about “search,” and have started joking that I’m a search panpsychist. Lots of interesting phenomenon look like piles of thermostats when viewed from the wrong angle, and I worry the conventional lens is deceptively narrow.
That said, when I condition on (what I understand to be) the conventional conception, it’s difficult for me to imagine how e.g. the maze-solver described in the OpenAI paper can quickly and reliably locate maze exits, without doing something reasonably describable as searching for them.
And it seems to me that Wang et al. should be taken as evidence that “learning algorithms producing other search-performing learning algorithms” is convergently useful/likely to be a common feature of future systems, even if you don’t think that’s what happened in their paper, as long as you assign decent credence to their underlying model that this is what’s going on in PFC, and that search occurs in PFC.
If the primary difference between the DeepMind and OpenAI meta-RL architecture and the PFC/DA architecture is scale, I think there’s reasonable reason to suspect something much like mesa-optimization will emerge in future meta-RL systems, even if it hasn’t yet. That is, I interpret this result as evidence for the hypothesis that highly competent general-ish learners might tend to exhibit this feature, since (among other reasons) it increased my credence that it is already exhibited by the only existing member of that reference class.
Evan mentions agreeing that this result isn’t new evidence in favor of mesa-optimization. But he also mentions that Risks from Learned Optimization references these two papers, and describes them as “the closest to producing mesa-optimizers of any existing machine learning research.” I feel confused about how to reconcile these two claims. I didn’t realize these papers were mentioned in Risks from Learned Optimization, but if I had, I think I would have been even more inclined to post this/try to ensure people knew about the results, since my (perhaps naive, perhaps not understanding ways this is disanalogous) prior is that the closest existing example to this problem might provide evidence about its nature or likelihood.
I get the impression from your comments that you think it’s naive to describe this result as “learning algorithms spontaneously emerge.”
I think that’s a fine characterization (and I said so in the grandparent comment? Looking back, I said I agreed with the claim that learning is happening via neural net activations, which I guess doesn’t necessarily imply that I think it’s a fine characterization).
You describe the lack of LW/AF pushback against that description as “a community-wide failure,”
I think my original comment didn’t do a great job of phrasing my objection. My actual critique is that the community as a whole seems to be updating strongly on data-that-has-high-probability-if-you-know-basic-RL.
updating as a result toward thinking AF members “automatically believe anything written in a post without checking it.”
That was one of three possible explanations; I don’t have a strong view on which explanation is the primary cause (if any of them are). It’s more like “I observe clearly-to-me irrational behavior, this seems bad, even if I don’t know what’s causing it”. If I had to guess, I’d guess that the explanation is a combination of readers not bothering to check details and those who are checking details not knowing enough to point out that this is expected.
I feel confused about why, given your model of the situation, the researchers were surprised that this phenomenon occurred, and seem to think it was a novel finding that it will inevitably occur given the three conditions described.
Indeed, I am also confused by this, as I noted in the original comment:
I don’t understand why this was surprising to the original researchers
I have a couple of hypotheses, none of which seem particularly likely given that the authors are familiar with AI, so I just won’t speculate. I agree this is evidence against my claim that this would be obvious to RL researchers.
And this OpenAI paper [...] describes their result in similar terms:
Again, I don’t object to the description of this as learning a learning algorithm. I object to updating strongly on this. Note that the paper does not claim their results are surprising—it is written in a style of “we figured out how to make this approach work”. (The DeepMind paper does claim that the results are novel / surprising, but it is targeted at a neuroscience audience, to whom the results may indeed be surprising.)
I’ve been feeling very confused lately about how people talk about “search,” and have started joking that I’m a search panpsychist.
On the search panpsychist view, my position is that if you use deep RL to train an AGI policy, it is definitionally a mesa optimizer. (Like, anything that is “generally intelligent” has the ability to learn quickly, which on the search panpsychist view means that it is a mesa optimizer.) So in this world, “likelihood of mesa optimization via deep RL” is equivalent to “likelihood of AGI via deep RL”, and “likelihood that more general systems trained by deep RL will be mesa optimizers” is ~1 and you ~can’t update on it.
I imagine this was not your intention, but I’m a little worried that this comment will have an undesirable chilling effect. I think it’s good for people to share when members of DeepMind / OpenAI say something that sounds a lot like “we found evidence of mesaoptimization”.
I also think you’re right that we should be doing a lot better on pushing back against such claims. I hope LW/AF gets better at being as skeptical of AI researchers assertions that support risk as they are of those that undermine risk. But I also hope that when those researchers claim something surprising and (to us) plausibly risky is going on, we continue to hear about it.
I imagine this was not your intention, but I’m a little worried that this comment will have an undesirable chilling effect.
Note that there are desirable chilling effects too. I think it’s broadly important to push back on inaccurate claims, or ones that have the wrong level of confidence. (Like, my comment elsewhere is intended to have a chilling effect.)
I imagine this was not your intention, but I’m a little worried that this comment will have an undesirable chilling effect
I agree it might have this effect, and that it would be bad if that were to happen (all else equal). But I’d much rather have researchers with good beliefs given the evidence they have rather than researchers with lots of evidence but bad beliefs given that evidence.
(As with everything, this is a tradeoff. I haven’t specified exactly how you should weight the tradeoff, because that’s hard to do.)
Hmm, I don’t know unfortunately. I learned basic MDP theory from an undergrad course, and the rest through osmosis by being an AI PhD student at Berkeley. I haven’t read Sutton and Barto, but I would assume that would be good enough (you’d probably know more than me about tabular RL).
If you don’t have a resource, then do you have a list of pointers to what people should learn? For example the policy gradient theorem and the REINFORCE trick. It will probably not be exhaustive, I’m just trying to make your call to learn more RL theory more actionable to people here.
I don’t think the takeaway here should be “read these books / watch these lectures / understand these concepts and you’ll be fine”. My claim is more like, if you want to interact with some community, you should have whatever background knowledge that community expects. Even if I just made a list of concepts, I’d expect that list to be out of date reasonably quickly (a few years), for a field like deep RL.
I think this is pretty important if you want to do any of:
Convince researchers in the field that their work would be risky if scaled up
Learn from evidence presented in papers from the field (this post)
Forecast questions relevant to the field, for questions that don’t have obvious base rates (e.g. AGI timelines)
If you don’t have the background knowledge, you can rely on someone else who has such background knowledge.
Notably, this is not important if you want to “build basic theory” or something like that, which doesn’t require interaction with the AI community. (Though it might be important for guiding your search for basic theory, I’m not sure.)
Also, I forgot to mention this before: normally for deep RL I’d recommend Spinning Up in Deep RL, though in this case that’s too focused on deep RL and not enough on RL basics.
----
EDIT: An analogy: if someone asked a handyman for a list of resources on how to fix common house problems, it’s not clear that the handyman would have remembered to give the advice “turn clockwise to tighten, and counterclockwise to loosen”, because it’s so ingrained. Similarly, I think if I had tried to give a list prior to seeing this post, I would not have thought to give the advice “think about what the optimal policy is, and then expect your RL algorithms to find similar policies”.
The handyman might not give basic advice, but if he didn’t have any advice, I would assume that he doesn’t want to help.
I’m really confused by your answers. You have a long comment criticizing the lack of basic RL knowledge of the AF community, and when I ask you for pointers, you say that you don’t want to give any, and that people should just learn the background knowledge. So should every member of the AF stop what they’re doing right now to spend 5 years doing a PhD in RL before being able to post here?
If the goal of your comment was to push people to learn things you think they should know, pointing towards some stuff (not an exhaustive list) is the bare minimum for that to be effective. If you don’t, I can’t see many people investing the time to learn enough RL so that by osmosis they can understand a point you’re making.
If the goal of your comment was to push people to learn things you think they should know, pointing towards some stuff (not an exhaustive list) is the bare minimum for that to be effective.
Here’s an obvious next step for people: google for resources on RL, ask others for recommendations on RL, try out some of the resources and see which one works best for you, and then choose one resource and dive deep into it, potentially repeat until you understand new RL papers by reading. I think people would be better off executing that algorithm than looking at specific resources that I might name.
I wouldn’t be surprised if other people have better algorithms for self-learning new fields—I’m pretty atypical and shouldn’t be expected to know what works for people who aren’t me. E.g. TurnTrout has done a lot of self-learning from textbooks and probably has better advice.
I would hope most AF readers are capable of coming up with and executing something like this algorithm. If not, there are bigger problems than the lack of RL knowledge.
----
I also don’t buy that pointing out a problem is only effective if you have a concrete solution in mind. MIRI argues that it is a problem that we don’t know how to align powerful AI systems, but doesn’t seem to have any concrete solutions. Do you think this disqualifies MIRI from talking about AI risk and asking people to work on solving it?
E.g. TurnTrout has done a lot of self-learning from textbooks and probably has better advice [for learning RL]
I have been summoned! I’ve read a few RL textbooks… unfortunately, they’re either a) very boring, b) very old, or c) very superficial. I’ve read:
Reinforcement Learning by Sutton & Barto (my book review)
Nice book for learning the basics. Best textbook I’ve read for RL, but that’s not saying much.
Superficial, not comprehensive, somewhat outdated circa 2018; a good chunk was focused on older techniques I never/rarely read about again, like SARSA and exponential feature decay for credit assignment. The closest I remember them getting to DRL was when they discussed the challenges faced by function approximators.
AI: A Modern Approach 3e by Russell & Norvig (my book review)
Engaging and clear, but most of the book wasn’t about RL. Outdated, but 4e is out now and maybe it’s better.
Markov Decision Processes by Puterman
Thorough, theoretical, very old, and very boring. Formal and dry. It was written decades ago, so obviously no mention of Deep RL.
Neuro-Dynamic Programming by Tsitsiklis
When I was a wee second-year grad student, I was independently recommended this book by several senior researchers. Apparently it’s a classic. It’s very dry and was written in 1996. Pass.
OpenAI’s several-page web tutorial Spinning Up with Deep RL is somehow the most useful beginning RL material I’ve seen, outside of actually taking a class. Kinda sad.
So when I ask my brain things like “how do I know about bandits?”, the result isn’t “because I read it in {textbook #23}”, but rather “because I worked on different tree search variants my first summer of grad school” or “because I took a class”. I think most of my RL knowledge has come from:
My own theoretical RL research
the fastest way for me to figure out a chunk of relevant MDP theory is often just to derive it myself
Watercooler chats with other grad students
Sorry to say that I don’t have clear pointers to good material.
I do share your opinion on the Sutton and Barto, which is the only book I read from your list (except a bit of the Russell and Norvig, but not the RL chapter). Notably, I took a lot of time to study the action value methods, only to realise later that a lot of recent work focus instead of policy-gradient methods (even if actor critics do use action-values).
From your answer and Rohin’s, I gather that we lack a good resource in Deep RL, at least of the kind useful for AI Safety researchers. It makes me even more curious of the kind of knowledge that would be treated in such a resource.
Here’s an obvious next step for people: google for resources on RL, ask others for recommendations on RL, try out some of the resources and see which one works best for you, and then choose one resource and dive deep into it, potentially repeat until you understand new RL papers by reading.
Agreed. Which is exactly why I asked you for recommendations. I don’t think you’re the only one someone interested in RL should ask for recommendation (I already asked other people, and knew some resource before all this), but as one of the (apparently few) members of the AF with the relevant skills in RL, it seemed that you might offer good advice on the topic.
About self-learning, I’m pretty sure people around here are good on this count. But knowing how to self-learn doesn’t mean knowing what to self-learning. Hence the pointers.
I also don’t buy that pointing out a problem is only effective if you have a concrete solution in mind. MIRI argues that it is a problem that we don’t know how to align powerful AI systems, but doesn’t seem to have any concrete solutions. Do you think this disqualifies MIRI from talking about AI risk and asking people to work on solving it?
No, I don’t think you should only point to a problem with a concrete solution in hands. But solving a research problem (what MIRI’s case is about) is not the same as learning a well-established field of computer science (what this discussion is about). In the latter case, you ask for people to learn things that already exists, not to invent them. And I do believe that showing some concrete things that might be relevant (as I repeated in each comment, not an exhaustive list) would make the injunction more effective.
That being said, it’s perfectly okay if you don’t want to propose anything. I’m just confused because it seems low effort for you, net positive, and the kind of “ask people for recommendation” that you preach in the previous comment. Maybe we disagree on one of these points?
Which is exactly why I asked you for recommendations.
Yes, I never said you shouldn’t ask me for recommendations. I’m saying that I don’t have any good recommendations to give, and you should probably ask other people for recommendations.
showing some concrete things that might be relevant (as I repeated in each comment, not an exhaustive list) would make the injunction more effective.
In practice I find that anything I say tends to lose its nuance as it spreads, so I’ve moved towards saying fewer things that require nuance. If I said “X might be a good resource to learn from but I don’t really know”, I would only be a little surprised to hear a complaint in the future of the form “I deeply read X for two months because Rohin recommended it, but I still can’t understand this deep RL paper”.
If I actually were confident in some resource, I agree it would be more effective to mention it.
I’m just confused because it seems low effort for you, net positive, and the kind of “ask people for recommendation” that you preach in the previous comment.
I’m not convinced the low effort version is net positive, for the reasons mentioned above. Note that I’ve already very weakly endorsed your mention of Sutton and Barto, and very weakly mentioned Spinning Up in Deep RL. (EDIT: TurnTrout doesn’t endorse Sutton and Barto much, so now neither do I.)
In practice I find that anything I say tends to lose its nuance as it spreads, so I’ve moved towards saying fewer things that require nuance. If I said “X might be a good resource to learn from but I don’t really know”, I would only be a little surprised to hear a complaint in the future of the form “I deeply read X for two months because Rohin recommended it, but I still can’t understand this deep RL paper”.
Hum, I did not think about that. It makes more sense to me now why you don’t want to point people towards specific things. I still believe the result will be net positive if the right caveat are in place (then it’s the other’s fault for misinterpreting your comment), but that’s indeed assuming that the resource/concept is good/important and you’re confident in that.
This is an aside, but I remain really confused by the claim that RL algorithms will tend to find policies close to the optimal one. Is inductive bias not a thing for RL?
It’s a thing, and is one of the caveats I mentioned.
For tabular RL, algorithms can find optimal policies in the limit of infinite exploration, but without infinite exploration how close you get to the optimal policy will depend on the environment (including reward function).
For deep RL, even with infinite exploration you don’t get the guarantee, since the optimization problem is nonconvex, and the optimal policy may not be expressible by your neural net. So it again depends heavily on the environment.
I think the proper version of the claim is more like “if a paper reports results with RL, the policy they find is probably good, as otherwise they wouldn’t have published it”. In practice RL algorithms often fail and need to be heavily tuned to do well, and researchers have to pull out lots of tricks to get them to work.
But regardless, I claim the first-order approximation to what an RL algorithm will do is “the optimal policy”. You can then figure out reasons for deviation, e.g. “this reward is super sparse, so the algorithm won’t get learning signal, so it’ll have effectively random behavior”.
If someone expected RL algorithms to fail on this bandit task, and then updated because they succeeded, I’d find that reasonable (though I’d find it pretty surprising that they’d expect a failure on bandits—it’s a relatively simple task where you can get tons of data).
It might well be that 1) people who already know RL shouldn’t be much surprised by this result and 2) people who don’t know much RL are justified in updating on this info (towards mesa-optimizers arising more easily).
This would be the case if RL intuition correctly implies that proto-mesa-optimizers (like the one in the paper) arise naturally, and that intuition wasn’t widely shared outside of RL. Not sure if this is actually the way things are, but it seems plausible to me.
It might well be that 1) people who already know RL shouldn’t be much surprised by this result and 2) people who don’t know much RL are justified in updating on this info (towards mesa-optimizers arising more easily).
I agree. It seems pretty bad if the participants of a forum about AI alignment don’t know RL.
(EDIT: I’m already seeing downvotes of the post, it was originally at 58 AF karma. This wasn’t my intention: I think this is a failure of the community as a whole, not of the author.)
I’m very confused by this edit.
My model of the community’s failure is roughly
This post primarily argues that a phenomenon is evidence for [learned models being likely to encode search algorithms], but in fact it is not.
We would like the community to be such that this is pointed out quickly, the author edits the post accordingly, and the post does not get super high reception
Instead, the post has high karma, is curated, this wasn’t pointed out until you said it, and the post has not been edited.
If part of the failure is that the post is well-received, why wouldn’t you want people to downvote it now that you pointed it out?
I also think the average LW user shouldn’t be expected to understand enough RL to see this, so the system should detect this kind of failure for them. (Which it has done now that you’ve written your comment.) For those people, the proper reaction seems to be to remove their upvote and perhaps downvote.
Separately, I think you can explain part of the failure by laziness rather than a lack of understanding of RL. You could read/skim this post and not quite understand what the setting actually is (even though it’s mentioned at the end of the second chapter). Just like I don’t think the average LW user should be expected to understand enough ML to realize that the main point is misleading, I also don’t they they should be expected to read the post carefully enough before upvoting it, especially not if it’s curated or high karma (because that should be a quality assurance, and at that point it seems fine to upvote purely to signal-boost the point).
I realize your critique was of the AF, not of LW, so I’m not sure how much I’m really disagreeing with you here. But since Evan Hubinger understood the point and upvoted the post anyway, it’s unclear how much you can conclude. (EDIT after rohin’s answer: actually, I agree this is most likely not a typical case.)
If part of the failure is that the post is well-received, why wouldn’t you want people to downvote it now that you pointed it out?
It feels like downvotes-as-I-see-them-in-practice are some combination of “you should feel bad about having written this” and “make worse content less visible”, and I didn’t want the first effect. Idk if that’s the right call though, and idk if that’s how others (especially the author) interpret it.
I also neglected that people can just remove their upvotes without downvoting, which feels less bad (though from the author’s perspective it’s the same, so I think I’m just being inconsistent here).
I also think the average LW user shouldn’t be expected to understand enough RL to see this
Agreed, which is why I focused on the AF karma rather than the LW karma. (I agree with the rest of that paragraph.)
Separately, I think you can explain part of the failure by laziness rather than a lack of understanding of RL. You could read/skim this post and not quite understand what the setting actually is (even though it’s mentioned at the end of the second chapter). Just like I don’t think the average LW user should be expected to understand enough ML to realize that the main point is misleading, I also don’t they they should be expected to read the post carefully enough before upvoting it, especially not if it’s curated or high karma (because that should be a quality assurance, and at that point it seems fine to upvote purely to signal-boost the point).
Agreed this is likely but still seems pretty bad—this isn’t the first time people would have updated incorrectly had I not made a correction, though this is the most upvoted case. (I perhaps find it more annoying than it really “should” be because of how much shit LW gives academia and peer review.)
I realize your critique was of the AF, not of LW, so I’m not sure how much I’m really disagreeing with you here.
Yeah, I think this would still be a critique of LW, but much less strongly.
But since Evan Hubinger understood the point and upvoted the post anyway, it’s unclear how much you can conclude.
I give it 98% chance that the majority of people who upvoted did not understand the point.
Agreed, which is why I focused on the AF karma rather than the LW karma
I think it’s worth pointing out that I originally saw this just posted to LW, and must have been manually promoted to AF by a mod. Partly want to point it out because possibly one of the main errors is people updating too much on promotion as a signal of quality
It’s trivially correct to update downward on the de-facto importance of promotion (by however much), but this seems like a bad thing.
Naively, I would like people to make sure they understand the point at
the curation step
the promotion-to-AF step
maybe at the upvote step if you’re a professional AI safety researcher
And if the conclusion is that the post is meaningful despite possibly being misinterpreted, I would naively want the person in charge to PM the author and ask to put in a clarification before the post is curated/promoted.
I say ‘naively’ because I don’t know anything about how hard it would be to achieve this and I could be genuinely wrong about this being a reasonable thing to want.
This post primarily argues that a phenomenon is evidence for [learned models being likely to encode search algorithms]
I do mention interpreting the described results as tentative evidence for mesa-optimization, and this interpretation was why I wrote the post; my impression is still that this interpretation was basically correct. But most of the post is just quotes or paraphrased claims made by DeepMind researchers, rather than my own claims, since I didn’t feel sure enough to make the claims myself.
(EDIT: I’m already seeing downvotes of the post, it was originally at 58 AF karma. This wasn’t my intention: I think this is a failure of the community as a whole, not of the author.)
Okay, this has gotten enough karma and has been curated and has influenced another post, so I suppose I should engage, especially since I’m not planning to put this in the Alignment Newsletter.
(A lot copied over from this comment of mine)
This is extremely basic RL theory.
The linked paper studies bandit problems, where each episode of RL is a new bandit problem where the agent doesn’t know which arm gives maximal reward. Unsurprisingly, the agent learns to first explore, and then exploit the best arm. This is a simple consequence of the fact that you have to look at observations to figure out what to do. Basic POMDP theory will tell you that when you have partial observability your policy needs to depend on history, i.e. it needs to learn.
However, because bandit problems have been studied in the AI literature, and “learning algorithms” have been proposed to solve bandit problems, this very normal fact of a policy depending on observation history is now trotted out as “learning algorithms spontaneously emerge”. I don’t understand why this was surprising to the original researchers, it seems like if you just thought about what the optimal policy would be given the observable information, you would make exactly this prediction. Perhaps it’s because it’s primarily a neuroscience paper, and they weren’t very familiar with AI.
More broadly, I don’t understand what people are talking about when they speak of the “likelihood” of mesa optimization. If you mean the chance that the weights of a neural network are going to encode some search algorithm, then this paper should be ~zero evidence in favor of it. If you mean the chance than a policy trained by RL will “learn” without gradient descent, I can’t imagine a way that could fail to be true for an intelligent system trained by deep RL—presumably a system that is intelligent is capable of learning quickly, and when we talk about deep RL leading to an intelligent AI system, presumably we are talking about the policy being intelligent (what else?), therefore the policy must “learn” as it is being executed.
Gwern notes here that we’ve seen this elsewhere. This is because it’s exactly what you’d expect, just that in the other cases we call conditioning on observations “adaptation” rather than “learning”.
----
Meta: I’m disappointed that I had to be the one to point this out. (Though to be fair, Gwern clearly understands this point.) There’s clearly been a lot of engagement with this post, and yet this seemingly obvious point hasn’t been said. When I saw this post first come up, my immediate reaction was “oh I’m sure this is a typical LW example of a case where the optimal policy is interpreted as learning, I’m not even going to bother clicking on the link”. Do we really have so few people who understand machine learning, that of the many, many views this post must have had, not one person could figure this out? It’s really no surprise that ML researchers ignore us if this is the level of ML understanding we as a community have.
EDIT: I should give credit to Nevan for pointing out that this paper is not much evidence in favor of the hypothesis that the neural network weights encode some search algorithm (before I wrote this comment).
I note that this doesn’t feel like a problem to me, mostly because of reasons related to Explainers Shoot High. Aim Low!. Even among ML experts, many of them haven’t touched much RL, because they’re focused on another field. Why expect them to know basic RL theory, or to have connected that to all the other things that they know?
I don’t think I have a fully crisp view of this, but here’s my frame on it so far:
One view is that we design algorithms to do things, and those algorithms have properties that we can reason about. Another is that we design loss functions, and then search through random options for things that perform well on those loss functions. In the second view, often which options we search through doesn’t matter very much, because there’s something like the “optimal solution” that all things we actually find will be trying to approximate in one way or another.
Mesa-optimization is something like, “when we search through the options, will we find something that itself searches through a different set of options?”. Some of those searches are probably benign—the bandit algorithm updating its internal value function in response to evidence, for example—and some of those searches are probably malign (or, at least, dangerous). In particular, we might think we have restrictions on the behavior of the base-level optimizer that turn out to not apply to any subprocesses it manages to generate, and so those properties don’t actually hold overall.
But it seems to me like overall we’re somewhat confused about this. For example, the way I normally use the word “search”, it doesn’t apply to the bandit algorithm updating its internal value function. But does Abram’s distinction between mesa-search and mesa-control actually mean much? There’s lots of problems that you can solve exactly with calculus, and solve approximately with well-tuned simple linear estimators, and thus saying “oh, it can’t do calculus, it can only do linear estimates” won’t rule out it having a really good solution; presumably a similar thing could be true with “search” vs. “control,” where in fact you might be able to build a pretty good search-approximator out of elements that only do control.
So, what would it mean to talk about the “likelihood” of mesa optimization? Well, I remember a few years back when there was a lot of buzz about hierarchical RL. That is, you would have something like a policy for which ‘tactic’ (or ‘sub-policy’ or whatever you want to call it) to deploy, and then each ‘tactic’ is itself a policy for what action to take. In 2015, it would have been sensible to talk about the ‘likelihood’ of RL models in 2020 being organized that way. (Even now, we can talk about the likelihood that models in 2025 will be organized that way!) But, empirically, this seems to have mostly not helped (at least as we’ve tried it so far).
As we imagine deploying more complicated models, it feels like there are two broad classes of things that can happen during runtime:
‘Task location’, where they know what to do in a wide range of environments, and all they’re learning is which environment they’re in. The multi-armed bandit is definitely in this case; GPT-3 seems like it’s mostly doing this.
‘Task learning’, where they are running some sort of online learning process that gives them ‘new capabilities’ as they encounter new bits of the world.
The two blur into each other; you can imagine training a model to deal with a range of situations, and yet it also performs well on situations not seen in training (that are interpolations between situations it has seen, or where the old abstractions apply correctly, and thus aren’t “entirely new” situations). Just like some people argue that anything we know how to do isn’t “artificial intelligence”, you might get into a situation where anything we know how to do is task ‘location’ instead of task ‘learning.’
But to the extent that our safety guarantees rely on the lack of capability in an AI system, any ability for the AI system to do learning instead of location means that it may gain capabilities we didn’t expect it to have. That said, merely restricting it to ‘location’ may not help us very much, because if we misunderstand the abstractions that govern the system’s generalizability, we may underestimate what capabilities it will or won’t have.
I think people often underestimate the degree to which, if they want to see their opinions in a public forum, they will have to be the one to post them. This is both because some points are less widely understood than you might think, and because even if the someone understands the point, that doesn’t mean it connects to their interests in a way that would make them say anything about it.
I’m perfectly happy with good explanations that don’t assume background knowledge. The flaw I am pointing to has nothing to do with explanations. It is that despite this evidence being a clear consequence of basic RL theory, for some reason readers are treating it as important evidence. Clearly I should update negatively on things-AF-considers-important. At a more gears level, presumably I should update towards some combination of:
AF readers don’t know RL.
AF readers upvote anything that’s cheering for their team.
AF readers automatically believe anything written in a post without checking it
Any of these would be a pretty damning critique of the forum. And the update should be fairly strong, given that this was (prior to my comment) the highest-upvoted post ever by AF karma.
If you saw a post that ran an experiment where they put their hand in boiling water, and the conclusion was “boiling water is dangerous”, and you saw it get to be the most upvoted post ever on LessWrong, with future posts citing it as evidence for boiling water being dangerous, would your reaction be “huh, I guess I need to state my opinion that this is obvious”?
There’s a difference between “I’m surprised no one has made this connection to this other thing” and “I’m surprised that readers are updating on facts that I expected them to already know”.
I don’t usually expect my opinions to show up on a public forum. For example, I am continually sad but not surprised about the fact that AF focuses on mesa optimizers as separate from capability generalization without objective generalization.
I guess I should explain why I upvoted this post despite agreeing with you that it’s not new evidence in favor of mesa-optimization. I actually had a conversation about this post with Adam Shimi prior to you commenting on it where I explained to him that I thought that not only was none of it new but also that it wasn’t evidence about the internal structure of models and therefore wasn’t really evidence about mesa-optimization. Nevertheless, I chose to upvote the post and not comment my thoughts on it. Some reasons why I did that:
I generally upvote most attempts on LW/AF to engage with the academic literature—I think that LW/AF would generally benefit from engaging with academia more and I like to do what I can to encourage that when I see it.
I didn’t feel like any comment I would have made would have anything more to say than things I’ve said in the past. In fact, in “Risks from Learned Optimization” itself, we talk about both a) why we chose to be agnostic about whether current systems exhibit mesa-optimization due to the difficulty of determining whether a system is actually implementing search or not (link) and b) examples of current work that we thought did seem to come closest to being evidence of mesa-optimization such as RL^2 (and I think RL^2 is a better example than the work linked here) (link).
(Flagging that I curated the post, but was mostly relying on Ben and Habryka’s judgment, in part since I didn’t see much disagreement. Since this discussion I’ve become more agnostic about how important this post is)
One thing this comment makes me want is more nuanced reacts that people have affordance to communicate how they feel about a post, in a way that’s easier to aggregate.
Though I also notice that with this particular post it’s a bit unclear what the react would be appropriate, since it sounds like it’s not “disagree” so much as “this post seems confused” or something.
FWIW, I appreciated that your curation notice explicitly includes the desire for more commentary on the results, and that curating it seems to have been a contributor to there being more commentary.
FWIW, I say: don’t let that stop you! (Don’t be afraid to repeat yourself, especially if there’s evidence that the point has not been widely appreciated.)
Unfortunately, I also only have so much time, and I don’t generally think that repeating myself regularly in AF/LW comments is a super great use of it.
Very fair.
The solution is clear: someone needs to create an Evan bot that will comment on every post of the AF related to mesa-optimization, by providing the right pointers to the paper.
Fair enough, those are sensible reasons. I don’t like the fact that the incentive gradient points away from making intellectual progress, but it’s not an obvious choice.
Given karma inflation (as users gain more karma, their votes are worth more, but this doesn’t propagate backwards to earlier votes they cast, and more people become AF voters than lose AF voter status), I think the karma differences between this post and these other 4 50+ karma posts [1 2 3 4] are basically noise. So I think the actual question is “is this post really in that tier?”, to which “probably not” seems like a fair answer.
[I am thinking more about other points you’ve made, but it seemed worth writing a short reply on that point.]
Agreed. I still think I should update fairly strongly.
I agree this sort of thing is something you could mean by the “likelihood of mesa optimization”. As I said in the grandparent:
In practice, when people say they “updated in favor of mesa optimization”, they refer to evidence that says approximately nothing about what is “happening at runtime”; therefore I infer that they cannot be talking about mesa optimization in the sense you mean.
I appreciate you writing this, Rohin. I don’t work in ML, or do safety research, and it’s certainly possible I misunderstand how this meta-RL architecture works, or that I misunderstand what’s normal.
That said, I feel confused by a number of your arguments, so I’m working on a reply. Before I post it, I’d be grateful if you could help me make sure I understand your objections, so as to avoid accidentally publishing a long post in response to a position nobody holds.
I currently understand you to be making four main claims:
The system is just doing the totally normal thing “conditioning on observations,” rather than something it makes sense to describe as “giving rise to a separate learning algorithm.”
It is probably not the case that in this system, “learning is implemented in neural activation changes rather than neural weight changes.”
The system does not encode a search algorithm, so it provides “~zero evidence” about e.g. the hypothesis that mesa-optimization is convergently useful, or likely to be a common feature of future systems.
The above facts should be obvious to people familiar with ML.
Does this summary feel like it reasonably characterizes your objections?
Thanks. I know I came off pretty confrontational, sorry about that. I didn’t mean to target you specifically; I really do see this as bad at the community level but fine at the individual level.
I don’t think you’ve exactly captured what I meant, some comments below.
I think it is reasonable to describe it both as “conditioning on observations” and as “giving rise to a separate learning algorithm”.
On my interpretation of “learning” in this context, I would agree with that claim (i.e. I agree that learning is implemented in activation changes rather then weight changes via gradient descent). Idk what other people mean by “learning” though.
This sounds roughly right if you use the words as I mean them, but I suspect you aren’t using the words as I mean them.
There’s this thing where the mesa-optimization paper talks about a neural net that performs “search” via activation changes. When I read the paper, I took this to be an illustrative example, that was meant to stand in for “learning” more broadly, but that made more concrete and easier to reason about. (I didn’t think this consciously.) However, whenever I talk to people about this paper, they have different understandings of what is meant by “search”, and varying opinions on how much mesa optimization should be tied to “search”. But I think the typical opinion is that whether or not mesa optimization is happening depends on what algorithm the neural net weights encode, and you can’t deduce whether mesa optimization is happening just by looking at the behavior in the training environment, as it may just have “memorized” what good behavior is rather than “performing search”.
If you use this meaning of “search algorithm”, then you can’t tell whether a good policy is a “search algorithm” or not just by looking at behavior. Since this paper only talked about behavior of a good policy, it can’t be evidence in favor of “mesa-optimization-via-search-algorithm”.
Oh definitely not those, most people in ML have never heard of “mesa optimization”.
----
I think my response to Vaniver better illustrates my concerns, but let me take a stab at making a simple list of claims.
1. The optimal policy in the bandit environment considered in the paper requires keeping track of the rewards you have gotten in the past, and basing your future decisions on this information.
2. You shouldn’t be surprised when applying an RL algorithm to a problem leads to a near-optimal policy for that problem. (This has many caveats, but they aren’t particularly relevant.)
3. Therefore, you shouldn’t be surprised by the results in this paper.
4. Therefore, you shouldn’t be updating based on this paper.
5. Claims 1 and 2 require only basic knowledge about RL.
I feel confused about why, on this model, the researchers were surprised that this occurred, and seem to think it was a novel finding that it will inevitably occur given the three conditions described. Above, you mentioned the hypothesis that maybe they just weren’t very familiar with AI. But looking at the author list, and their publications (e.g.1, 2, 3, 4, 5, 6, 7, 8), this seems implausible to me. Most of the co-authors are neuroscientists by training, but a few have CS degrees, and all but one have co-authored previous ML papers. It’s hard for me to imagine their surprise was due to them lacking basic knowledge about RL?
Also, this OpenAI paper (whose authors seem quite familiar with ML)—which the summary of Wang et al. on DeepMind’s website describes as “closely related work,” and which appears to me to involve a very similar setup— describes their result similarly:
As I understand it, the OpenAI authors also think they can gather evidence about the structure of the algorithm simply by looking at its behavior. Given a similar series of experiments (mostly bandit tasks, but also a maze solver), they conclude:
They then run an experiment designed specifically to distinguish whether meta-RL was giving rise to a model-free system, or “a model-based system which learns an internal model of the environment and evaluates the value of actions at the time of decision-making through look-ahead planning,” and suggest the evidence implies the latter. This sounds like a description of search to me—do you think I’m confused?
I get the impression from your comments that you think it’s naive to describe this result as “learning algorithms spontaneously emerging.” You describe the lack of LW/AF pushback against that description as “a community-wide failure,” and mention updating as a result toward thinking AF members “automatically believe anything written in a post without checking it.”
But my impression is that OpenAI describes their similar result in a similar way. Do you think my impression is wrong? Or that e.g. their description is also misleading?
--
I’ve been feeling very confused lately about how people talk about “search,” and have started joking that I’m a search panpsychist. Lots of interesting phenomenon look like piles of thermostats when viewed from the wrong angle, and I worry the conventional lens is deceptively narrow.
That said, when I condition on (what I understand to be) the conventional conception, it’s difficult for me to imagine how e.g. the maze-solver described in the OpenAI paper can quickly and reliably locate maze exits, without doing something reasonably describable as searching for them.
And it seems to me that Wang et al. should be taken as evidence that “learning algorithms producing other search-performing learning algorithms” is convergently useful/likely to be a common feature of future systems, even if you don’t think that’s what happened in their paper, as long as you assign decent credence to their underlying model that this is what’s going on in PFC, and that search occurs in PFC.
If the primary difference between the DeepMind and OpenAI meta-RL architecture and the PFC/DA architecture is scale, I think there’s reasonable reason to suspect something much like mesa-optimization will emerge in future meta-RL systems, even if it hasn’t yet. That is, I interpret this result as evidence for the hypothesis that highly competent general-ish learners might tend to exhibit this feature, since (among other reasons) it increased my credence that it is already exhibited by the only existing member of that reference class.
Evan mentions agreeing that this result isn’t new evidence in favor of mesa-optimization. But he also mentions that Risks from Learned Optimization references these two papers, and describes them as “the closest to producing mesa-optimizers of any existing machine learning research.” I feel confused about how to reconcile these two claims. I didn’t realize these papers were mentioned in Risks from Learned Optimization, but if I had, I think I would have been even more inclined to post this/try to ensure people knew about the results, since my (perhaps naive, perhaps not understanding ways this is disanalogous) prior is that the closest existing example to this problem might provide evidence about its nature or likelihood.
I think that’s a fine characterization (and I said so in the grandparent comment? Looking back, I said I agreed with the claim that learning is happening via neural net activations, which I guess doesn’t necessarily imply that I think it’s a fine characterization).
I think my original comment didn’t do a great job of phrasing my objection. My actual critique is that the community as a whole seems to be updating strongly on data-that-has-high-probability-if-you-know-basic-RL.
That was one of three possible explanations; I don’t have a strong view on which explanation is the primary cause (if any of them are). It’s more like “I observe clearly-to-me irrational behavior, this seems bad, even if I don’t know what’s causing it”. If I had to guess, I’d guess that the explanation is a combination of readers not bothering to check details and those who are checking details not knowing enough to point out that this is expected.
Indeed, I am also confused by this, as I noted in the original comment:
I have a couple of hypotheses, none of which seem particularly likely given that the authors are familiar with AI, so I just won’t speculate. I agree this is evidence against my claim that this would be obvious to RL researchers.
Again, I don’t object to the description of this as learning a learning algorithm. I object to updating strongly on this. Note that the paper does not claim their results are surprising—it is written in a style of “we figured out how to make this approach work”. (The DeepMind paper does claim that the results are novel / surprising, but it is targeted at a neuroscience audience, to whom the results may indeed be surprising.)
On the search panpsychist view, my position is that if you use deep RL to train an AGI policy, it is definitionally a mesa optimizer. (Like, anything that is “generally intelligent” has the ability to learn quickly, which on the search panpsychist view means that it is a mesa optimizer.) So in this world, “likelihood of mesa optimization via deep RL” is equivalent to “likelihood of AGI via deep RL”, and “likelihood that more general systems trained by deep RL will be mesa optimizers” is ~1 and you ~can’t update on it.
I imagine this was not your intention, but I’m a little worried that this comment will have an undesirable chilling effect. I think it’s good for people to share when members of DeepMind / OpenAI say something that sounds a lot like “we found evidence of mesaoptimization”.
I also think you’re right that we should be doing a lot better on pushing back against such claims. I hope LW/AF gets better at being as skeptical of AI researchers assertions that support risk as they are of those that undermine risk. But I also hope that when those researchers claim something surprising and (to us) plausibly risky is going on, we continue to hear about it.
Note that there are desirable chilling effects too. I think it’s broadly important to push back on inaccurate claims, or ones that have the wrong level of confidence. (Like, my comment elsewhere is intended to have a chilling effect.)
I agree it might have this effect, and that it would be bad if that were to happen (all else equal). But I’d much rather have researchers with good beliefs given the evidence they have rather than researchers with lots of evidence but bad beliefs given that evidence.
(As with everything, this is a tradeoff. I haven’t specified exactly how you should weight the tradeoff, because that’s hard to do.)
What would be a good resource to level up on RL theory? Is the Sutton and Barto good enough, or do you have something else in mind?
Hmm, I don’t know unfortunately. I learned basic MDP theory from an undergrad course, and the rest through osmosis by being an AI PhD student at Berkeley. I haven’t read Sutton and Barto, but I would assume that would be good enough (you’d probably know more than me about tabular RL).
If you don’t have a resource, then do you have a list of pointers to what people should learn? For example the policy gradient theorem and the REINFORCE trick. It will probably not be exhaustive, I’m just trying to make your call to learn more RL theory more actionable to people here.
I don’t think the takeaway here should be “read these books / watch these lectures / understand these concepts and you’ll be fine”. My claim is more like, if you want to interact with some community, you should have whatever background knowledge that community expects. Even if I just made a list of concepts, I’d expect that list to be out of date reasonably quickly (a few years), for a field like deep RL.
I think this is pretty important if you want to do any of:
Convince researchers in the field that their work would be risky if scaled up
Learn from evidence presented in papers from the field (this post)
Forecast questions relevant to the field, for questions that don’t have obvious base rates (e.g. AGI timelines)
If you don’t have the background knowledge, you can rely on someone else who has such background knowledge.
Notably, this is not important if you want to “build basic theory” or something like that, which doesn’t require interaction with the AI community. (Though it might be important for guiding your search for basic theory, I’m not sure.)
Also, I forgot to mention this before: normally for deep RL I’d recommend Spinning Up in Deep RL, though in this case that’s too focused on deep RL and not enough on RL basics.
----
EDIT: An analogy: if someone asked a handyman for a list of resources on how to fix common house problems, it’s not clear that the handyman would have remembered to give the advice “turn clockwise to tighten, and counterclockwise to loosen”, because it’s so ingrained. Similarly, I think if I had tried to give a list prior to seeing this post, I would not have thought to give the advice “think about what the optimal policy is, and then expect your RL algorithms to find similar policies”.
It’s the other way around, right?
Lol yes fixed
The handyman might not give basic advice, but if he didn’t have any advice, I would assume that he doesn’t want to help.
I’m really confused by your answers. You have a long comment criticizing the lack of basic RL knowledge of the AF community, and when I ask you for pointers, you say that you don’t want to give any, and that people should just learn the background knowledge. So should every member of the AF stop what they’re doing right now to spend 5 years doing a PhD in RL before being able to post here?
If the goal of your comment was to push people to learn things you think they should know, pointing towards some stuff (not an exhaustive list) is the bare minimum for that to be effective. If you don’t, I can’t see many people investing the time to learn enough RL so that by osmosis they can understand a point you’re making.
Here’s an obvious next step for people: google for resources on RL, ask others for recommendations on RL, try out some of the resources and see which one works best for you, and then choose one resource and dive deep into it, potentially repeat until you understand new RL papers by reading. I think people would be better off executing that algorithm than looking at specific resources that I might name.
I wouldn’t be surprised if other people have better algorithms for self-learning new fields—I’m pretty atypical and shouldn’t be expected to know what works for people who aren’t me. E.g. TurnTrout has done a lot of self-learning from textbooks and probably has better advice.
I would hope most AF readers are capable of coming up with and executing something like this algorithm. If not, there are bigger problems than the lack of RL knowledge.
----
I also don’t buy that pointing out a problem is only effective if you have a concrete solution in mind. MIRI argues that it is a problem that we don’t know how to align powerful AI systems, but doesn’t seem to have any concrete solutions. Do you think this disqualifies MIRI from talking about AI risk and asking people to work on solving it?
I have been summoned! I’ve read a few RL textbooks… unfortunately, they’re either a) very boring, b) very old, or c) very superficial. I’ve read:
Reinforcement Learning by Sutton & Barto (my book review)
Nice book for learning the basics. Best textbook I’ve read for RL, but that’s not saying much.
Superficial, not comprehensive, somewhat outdated circa 2018; a good chunk was focused on older techniques I never/rarely read about again, like SARSA and exponential feature decay for credit assignment. The closest I remember them getting to DRL was when they discussed the challenges faced by function approximators.
AI: A Modern Approach 3e by Russell & Norvig (my book review)
Engaging and clear, but most of the book wasn’t about RL. Outdated, but 4e is out now and maybe it’s better.
Markov Decision Processes by Puterman
Thorough, theoretical, very old, and very boring. Formal and dry. It was written decades ago, so obviously no mention of Deep RL.
Neuro-Dynamic Programming by Tsitsiklis
When I was a wee second-year grad student, I was independently recommended this book by several senior researchers. Apparently it’s a classic. It’s very dry and was written in 1996. Pass.
OpenAI’s several-page web tutorial Spinning Up with Deep RL is somehow the most useful beginning RL material I’ve seen, outside of actually taking a class. Kinda sad.
So when I ask my brain things like “how do I know about bandits?”, the result isn’t “because I read it in {textbook #23}”, but rather “because I worked on different tree search variants my first summer of grad school” or “because I took a class”. I think most of my RL knowledge has come from:
My own theoretical RL research
the fastest way for me to figure out a chunk of relevant MDP theory is often just to derive it myself
Watercooler chats with other grad students
Sorry to say that I don’t have clear pointers to good material.
Thanks for the in-depth answer!
I do share your opinion on the Sutton and Barto, which is the only book I read from your list (except a bit of the Russell and Norvig, but not the RL chapter). Notably, I took a lot of time to study the action value methods, only to realise later that a lot of recent work focus instead of policy-gradient methods (even if actor critics do use action-values).
From your answer and Rohin’s, I gather that we lack a good resource in Deep RL, at least of the kind useful for AI Safety researchers. It makes me even more curious of the kind of knowledge that would be treated in such a resource.
Agreed. Which is exactly why I asked you for recommendations. I don’t think you’re the only one someone interested in RL should ask for recommendation (I already asked other people, and knew some resource before all this), but as one of the (apparently few) members of the AF with the relevant skills in RL, it seemed that you might offer good advice on the topic.
About self-learning, I’m pretty sure people around here are good on this count. But knowing how to self-learn doesn’t mean knowing what to self-learning. Hence the pointers.
No, I don’t think you should only point to a problem with a concrete solution in hands. But solving a research problem (what MIRI’s case is about) is not the same as learning a well-established field of computer science (what this discussion is about). In the latter case, you ask for people to learn things that already exists, not to invent them. And I do believe that showing some concrete things that might be relevant (as I repeated in each comment, not an exhaustive list) would make the injunction more effective.
That being said, it’s perfectly okay if you don’t want to propose anything. I’m just confused because it seems low effort for you, net positive, and the kind of “ask people for recommendation” that you preach in the previous comment. Maybe we disagree on one of these points?
Yes, I never said you shouldn’t ask me for recommendations. I’m saying that I don’t have any good recommendations to give, and you should probably ask other people for recommendations.
In practice I find that anything I say tends to lose its nuance as it spreads, so I’ve moved towards saying fewer things that require nuance. If I said “X might be a good resource to learn from but I don’t really know”, I would only be a little surprised to hear a complaint in the future of the form “I deeply read X for two months because Rohin recommended it, but I still can’t understand this deep RL paper”.
If I actually were confident in some resource, I agree it would be more effective to mention it.
I’m not convinced the low effort version is net positive, for the reasons mentioned above. Note that I’ve already very weakly endorsed your mention of Sutton and Barto, and very weakly mentioned Spinning Up in Deep RL. (EDIT: TurnTrout doesn’t endorse Sutton and Barto much, so now neither do I.)
Hum, I did not think about that. It makes more sense to me now why you don’t want to point people towards specific things. I still believe the result will be net positive if the right caveat are in place (then it’s the other’s fault for misinterpreting your comment), but that’s indeed assuming that the resource/concept is good/important and you’re confident in that.
This is an aside, but I remain really confused by the claim that RL algorithms will tend to find policies close to the optimal one. Is inductive bias not a thing for RL?
It’s a thing, and is one of the caveats I mentioned.
For tabular RL, algorithms can find optimal policies in the limit of infinite exploration, but without infinite exploration how close you get to the optimal policy will depend on the environment (including reward function).
For deep RL, even with infinite exploration you don’t get the guarantee, since the optimization problem is nonconvex, and the optimal policy may not be expressible by your neural net. So it again depends heavily on the environment.
I think the proper version of the claim is more like “if a paper reports results with RL, the policy they find is probably good, as otherwise they wouldn’t have published it”. In practice RL algorithms often fail and need to be heavily tuned to do well, and researchers have to pull out lots of tricks to get them to work.
But regardless, I claim the first-order approximation to what an RL algorithm will do is “the optimal policy”. You can then figure out reasons for deviation, e.g. “this reward is super sparse, so the algorithm won’t get learning signal, so it’ll have effectively random behavior”.
If someone expected RL algorithms to fail on this bandit task, and then updated because they succeeded, I’d find that reasonable (though I’d find it pretty surprising that they’d expect a failure on bandits—it’s a relatively simple task where you can get tons of data).
It might well be that 1) people who already know RL shouldn’t be much surprised by this result and 2) people who don’t know much RL are justified in updating on this info (towards mesa-optimizers arising more easily).
This would be the case if RL intuition correctly implies that proto-mesa-optimizers (like the one in the paper) arise naturally, and that intuition wasn’t widely shared outside of RL. Not sure if this is actually the way things are, but it seems plausible to me.
I agree. It seems pretty bad if the participants of a forum about AI alignment don’t know RL.
I’m very confused by this edit.
My model of the community’s failure is roughly
This post primarily argues that a phenomenon is evidence for [learned models being likely to encode search algorithms], but in fact it is not.
We would like the community to be such that this is pointed out quickly, the author edits the post accordingly, and the post does not get super high reception
Instead, the post has high karma, is curated, this wasn’t pointed out until you said it, and the post has not been edited.
If part of the failure is that the post is well-received, why wouldn’t you want people to downvote it now that you pointed it out?
I also think the average LW user shouldn’t be expected to understand enough RL to see this, so the system should detect this kind of failure for them. (Which it has done now that you’ve written your comment.) For those people, the proper reaction seems to be to remove their upvote and perhaps downvote.
Separately, I think you can explain part of the failure by laziness rather than a lack of understanding of RL. You could read/skim this post and not quite understand what the setting actually is (even though it’s mentioned at the end of the second chapter). Just like I don’t think the average LW user should be expected to understand enough ML to realize that the main point is misleading, I also don’t they they should be expected to read the post carefully enough before upvoting it, especially not if it’s curated or high karma (because that should be a quality assurance, and at that point it seems fine to upvote purely to signal-boost the point).
I realize your critique was of the AF, not of LW, so I’m not sure how much I’m really disagreeing with you here. But since Evan Hubinger understood the point and upvoted the post anyway, it’s unclear how much you can conclude. (EDIT after rohin’s answer: actually, I agree this is most likely not a typical case.)
It feels like downvotes-as-I-see-them-in-practice are some combination of “you should feel bad about having written this” and “make worse content less visible”, and I didn’t want the first effect. Idk if that’s the right call though, and idk if that’s how others (especially the author) interpret it.
I also neglected that people can just remove their upvotes without downvoting, which feels less bad (though from the author’s perspective it’s the same, so I think I’m just being inconsistent here).
Agreed, which is why I focused on the AF karma rather than the LW karma. (I agree with the rest of that paragraph.)
Agreed this is likely but still seems pretty bad—this isn’t the first time people would have updated incorrectly had I not made a correction, though this is the most upvoted case. (I perhaps find it more annoying than it really “should” be because of how much shit LW gives academia and peer review.)
Yeah, I think this would still be a critique of LW, but much less strongly.
I give it 98% chance that the majority of people who upvoted did not understand the point.
I think it’s worth pointing out that I originally saw this just posted to LW, and must have been manually promoted to AF by a mod. Partly want to point it out because possibly one of the main errors is people updating too much on promotion as a signal of quality
It’s trivially correct to update downward on the de-facto importance of promotion (by however much), but this seems like a bad thing.
Naively, I would like people to make sure they understand the point at
the curation step
the promotion-to-AF step
maybe at the upvote step if you’re a professional AI safety researcher
And if the conclusion is that the post is meaningful despite possibly being misinterpreted, I would naively want the person in charge to PM the author and ask to put in a clarification before the post is curated/promoted.
I say ‘naively’ because I don’t know anything about how hard it would be to achieve this and I could be genuinely wrong about this being a reasonable thing to want.
I do mention interpreting the described results as tentative evidence for mesa-optimization, and this interpretation was why I wrote the post; my impression is still that this interpretation was basically correct. But most of the post is just quotes or paraphrased claims made by DeepMind researchers, rather than my own claims, since I didn’t feel sure enough to make the claims myself.