I note that this doesn’t feel like a problem to me, mostly because of reasons related to Explainers Shoot High. Aim Low!. Even among ML experts, many of them haven’t touched much RL, because they’re focused on another field. Why expect them to know basic RL theory, or to have connected that to all the other things that they know?
More broadly, I don’t understand what people are talking about when they speak of the “likelihood” of mesa optimization.
I don’t think I have a fully crisp view of this, but here’s my frame on it so far:
One view is that we design algorithms to do things, and those algorithms have properties that we can reason about. Another is that we design loss functions, and then search through random options for things that perform well on those loss functions. In the second view, often which options we search through doesn’t matter very much, because there’s something like the “optimal solution” that all things we actually find will be trying to approximate in one way or another.
Mesa-optimization is something like, “when we search through the options, will we find something that itself searches through a different set of options?”. Some of those searches are probably benign—the bandit algorithm updating its internal value function in response to evidence, for example—and some of those searches are probably malign (or, at least, dangerous). In particular, we might think we have restrictions on the behavior of the base-level optimizer that turn out to not apply to any subprocesses it manages to generate, and so those properties don’t actually hold overall.
But it seems to me like overall we’re somewhat confused about this. For example, the way I normally use the word “search”, it doesn’t apply to the bandit algorithm updating its internal value function. But does Abram’s distinction between mesa-search and mesa-control actually mean much? There’s lots of problems that you can solve exactly with calculus, and solve approximately with well-tuned simple linear estimators, and thus saying “oh, it can’t do calculus, it can only do linear estimates” won’t rule out it having a really good solution; presumably a similar thing could be true with “search” vs. “control,” where in fact you might be able to build a pretty good search-approximator out of elements that only do control.
So, what would it mean to talk about the “likelihood” of mesa optimization? Well, I remember a few years back when there was a lot of buzz about hierarchical RL. That is, you would have something like a policy for which ‘tactic’ (or ‘sub-policy’ or whatever you want to call it) to deploy, and then each ‘tactic’ is itself a policy for what action to take. In 2015, it would have been sensible to talk about the ‘likelihood’ of RL models in 2020 being organized that way. (Even now, we can talk about the likelihood that models in 2025 will be organized that way!) But, empirically, this seems to have mostly not helped (at least as we’ve tried it so far).
As we imagine deploying more complicated models, it feels like there are two broad classes of things that can happen during runtime:
‘Task location’, where they know what to do in a wide range of environments, and all they’re learning is which environment they’re in. The multi-armed bandit is definitely in this case; GPT-3 seems like it’s mostly doing this.
‘Task learning’, where they are running some sort of online learning process that gives them ‘new capabilities’ as they encounter new bits of the world.
The two blur into each other; you can imagine training a model to deal with a range of situations, and yet it also performs well on situations not seen in training (that are interpolations between situations it has seen, or where the old abstractions apply correctly, and thus aren’t “entirely new” situations). Just like some people argue that anything we know how to do isn’t “artificial intelligence”, you might get into a situation where anything we know how to do is task ‘location’ instead of task ‘learning.’
But to the extent that our safety guarantees rely on the lack of capability in an AI system, any ability for the AI system to do learning instead of location means that it may gain capabilities we didn’t expect it to have. That said, merely restricting it to ‘location’ may not help us very much, because if we misunderstand the abstractions that govern the system’s generalizability, we may underestimate what capabilities it will or won’t have.
There’s clearly been a lot of engagement with this post, and yet this seemingly obvious point hasn’t been said.
I think people often underestimate the degree to which, if they want to see their opinions in a public forum, they will have to be the one to post them. This is both because some points are less widely understood than you might think, and because even if the someone understands the point, that doesn’t mean it connects to their interests in a way that would make them say anything about it.
I note that this doesn’t feel like a problem to me, mostly because of reasons related to Explainers Shoot High. Aim Low!. Even among ML experts, many of them haven’t touched much RL, because they’re focused on another field. Why expect them to know basic RL theory, or to have connected that to all the other things that they know?
I’m perfectly happy with good explanations that don’t assume background knowledge. The flaw I am pointing to has nothing to do with explanations. It is that despite this evidence being a clear consequence of basic RL theory, for some reason readers are treating it as important evidence. Clearly I should update negatively on things-AF-considers-important. At a more gears level, presumably I should update towards some combination of:
AF readers don’t know RL.
AF readers upvote anything that’s cheering for their team.
AF readers automatically believe anything written in a post without checking it
Any of these would be a pretty damning critique of the forum. And the update should be fairly strong, given that this was (prior to my comment) the highest-upvoted post ever by AF karma.
I think people often underestimate the degree to which, if they want to see their opinions in a public forum, they will have to be the one to post them.
If you saw a post that ran an experiment where they put their hand in boiling water, and the conclusion was “boiling water is dangerous”, and you saw it get to be the most upvoted post ever on LessWrong, with future posts citing it as evidence for boiling water being dangerous, would your reaction be “huh, I guess I need to state my opinion that this is obvious”?
There’s a difference between “I’m surprised no one has made this connection to this other thing” and “I’m surprised that readers are updating on facts that I expected them to already know”.
I don’t usually expect my opinions to show up on a public forum. For example, I am continually sad but not surprised about the fact that AF focuses on mesa optimizers as separate from capability generalization without objective generalization.
I guess I should explain why I upvoted this post despite agreeing with you that it’s not new evidence in favor of mesa-optimization. I actually had a conversation about this post with Adam Shimi prior to you commenting on it where I explained to him that I thought that not only was none of it new but also that it wasn’t evidence about the internal structure of models and therefore wasn’t really evidence about mesa-optimization. Nevertheless, I chose to upvote the post and not comment my thoughts on it. Some reasons why I did that:
I generally upvote most attempts on LW/AF to engage with the academic literature—I think that LW/AF would generally benefit from engaging with academia more and I like to do what I can to encourage that when I see it.
I didn’t feel like any comment I would have made would have anything more to say than things I’ve said in the past. In fact, in “Risks from Learned Optimization” itself, we talk about both a) why we chose to be agnostic about whether current systems exhibit mesa-optimization due to the difficulty of determining whether a system is actually implementing search or not (link) and b) examples of current work that we thought did seem to come closest to being evidence of mesa-optimization such as RL^2 (and I think RL^2 is a better example than the work linked here) (link).
(Flagging that I curated the post, but was mostly relying on Ben and Habryka’s judgment, in part since I didn’t see much disagreement. Since this discussion I’ve become more agnostic about how important this post is)
One thing this comment makes me want is more nuanced reacts that people have affordance to communicate how they feel about a post, in a way that’s easier to aggregate.
Though I also notice that with this particular post it’s a bit unclear what the react would be appropriate, since it sounds like it’s not “disagree” so much as “this post seems confused” or something.
FWIW, I appreciated that your curation notice explicitly includes the desire for more commentary on the results, and that curating it seems to have been a contributor to there being more commentary.
I didn’t feel like any comment I would have made would have anything more to say than things I’ve said in the past.
FWIW, I say: don’t let that stop you! (Don’t be afraid to repeat yourself, especially if there’s evidence that the point has not been widely appreciated.)
Unfortunately, I also only have so much time, and I don’t generally think that repeating myself regularly in AF/LW comments is a super great use of it.
The solution is clear: someone needs to create an Evan bot that will comment on every post of the AF related to mesa-optimization, by providing the right pointers to the paper.
Fair enough, those are sensible reasons. I don’t like the fact that the incentive gradient points away from making intellectual progress, but it’s not an obvious choice.
And the update should be fairly strong, given that this was (prior to my comment) the highest-upvoted post ever by AF karma.
Given karma inflation (as users gain more karma, their votes are worth more, but this doesn’t propagate backwards to earlier votes they cast, and more people become AF voters than lose AF voter status), I think the karma differences between this post and these other 4 50+ karma posts [1234] are basically noise. So I think the actual question is “is this post really in that tier?”, to which “probably not” seems like a fair answer.
[I am thinking more about other points you’ve made, but it seemed worth writing a short reply on that point.]
it feels like there are two broad classes of things that can happen during runtime:
I agree this sort of thing is something you could mean by the “likelihood of mesa optimization”. As I said in the grandparent:
this paper should be ~zero evidence in favor of [mesa optimization in the task learning sense].
In practice, when people say they “updated in favor of mesa optimization”, they refer to evidence that says approximately nothing about what is “happening at runtime”; therefore I infer that they cannot be talking about mesa optimization in the sense you mean.
I note that this doesn’t feel like a problem to me, mostly because of reasons related to Explainers Shoot High. Aim Low!. Even among ML experts, many of them haven’t touched much RL, because they’re focused on another field. Why expect them to know basic RL theory, or to have connected that to all the other things that they know?
I don’t think I have a fully crisp view of this, but here’s my frame on it so far:
One view is that we design algorithms to do things, and those algorithms have properties that we can reason about. Another is that we design loss functions, and then search through random options for things that perform well on those loss functions. In the second view, often which options we search through doesn’t matter very much, because there’s something like the “optimal solution” that all things we actually find will be trying to approximate in one way or another.
Mesa-optimization is something like, “when we search through the options, will we find something that itself searches through a different set of options?”. Some of those searches are probably benign—the bandit algorithm updating its internal value function in response to evidence, for example—and some of those searches are probably malign (or, at least, dangerous). In particular, we might think we have restrictions on the behavior of the base-level optimizer that turn out to not apply to any subprocesses it manages to generate, and so those properties don’t actually hold overall.
But it seems to me like overall we’re somewhat confused about this. For example, the way I normally use the word “search”, it doesn’t apply to the bandit algorithm updating its internal value function. But does Abram’s distinction between mesa-search and mesa-control actually mean much? There’s lots of problems that you can solve exactly with calculus, and solve approximately with well-tuned simple linear estimators, and thus saying “oh, it can’t do calculus, it can only do linear estimates” won’t rule out it having a really good solution; presumably a similar thing could be true with “search” vs. “control,” where in fact you might be able to build a pretty good search-approximator out of elements that only do control.
So, what would it mean to talk about the “likelihood” of mesa optimization? Well, I remember a few years back when there was a lot of buzz about hierarchical RL. That is, you would have something like a policy for which ‘tactic’ (or ‘sub-policy’ or whatever you want to call it) to deploy, and then each ‘tactic’ is itself a policy for what action to take. In 2015, it would have been sensible to talk about the ‘likelihood’ of RL models in 2020 being organized that way. (Even now, we can talk about the likelihood that models in 2025 will be organized that way!) But, empirically, this seems to have mostly not helped (at least as we’ve tried it so far).
As we imagine deploying more complicated models, it feels like there are two broad classes of things that can happen during runtime:
‘Task location’, where they know what to do in a wide range of environments, and all they’re learning is which environment they’re in. The multi-armed bandit is definitely in this case; GPT-3 seems like it’s mostly doing this.
‘Task learning’, where they are running some sort of online learning process that gives them ‘new capabilities’ as they encounter new bits of the world.
The two blur into each other; you can imagine training a model to deal with a range of situations, and yet it also performs well on situations not seen in training (that are interpolations between situations it has seen, or where the old abstractions apply correctly, and thus aren’t “entirely new” situations). Just like some people argue that anything we know how to do isn’t “artificial intelligence”, you might get into a situation where anything we know how to do is task ‘location’ instead of task ‘learning.’
But to the extent that our safety guarantees rely on the lack of capability in an AI system, any ability for the AI system to do learning instead of location means that it may gain capabilities we didn’t expect it to have. That said, merely restricting it to ‘location’ may not help us very much, because if we misunderstand the abstractions that govern the system’s generalizability, we may underestimate what capabilities it will or won’t have.
I think people often underestimate the degree to which, if they want to see their opinions in a public forum, they will have to be the one to post them. This is both because some points are less widely understood than you might think, and because even if the someone understands the point, that doesn’t mean it connects to their interests in a way that would make them say anything about it.
I’m perfectly happy with good explanations that don’t assume background knowledge. The flaw I am pointing to has nothing to do with explanations. It is that despite this evidence being a clear consequence of basic RL theory, for some reason readers are treating it as important evidence. Clearly I should update negatively on things-AF-considers-important. At a more gears level, presumably I should update towards some combination of:
AF readers don’t know RL.
AF readers upvote anything that’s cheering for their team.
AF readers automatically believe anything written in a post without checking it
Any of these would be a pretty damning critique of the forum. And the update should be fairly strong, given that this was (prior to my comment) the highest-upvoted post ever by AF karma.
If you saw a post that ran an experiment where they put their hand in boiling water, and the conclusion was “boiling water is dangerous”, and you saw it get to be the most upvoted post ever on LessWrong, with future posts citing it as evidence for boiling water being dangerous, would your reaction be “huh, I guess I need to state my opinion that this is obvious”?
There’s a difference between “I’m surprised no one has made this connection to this other thing” and “I’m surprised that readers are updating on facts that I expected them to already know”.
I don’t usually expect my opinions to show up on a public forum. For example, I am continually sad but not surprised about the fact that AF focuses on mesa optimizers as separate from capability generalization without objective generalization.
I guess I should explain why I upvoted this post despite agreeing with you that it’s not new evidence in favor of mesa-optimization. I actually had a conversation about this post with Adam Shimi prior to you commenting on it where I explained to him that I thought that not only was none of it new but also that it wasn’t evidence about the internal structure of models and therefore wasn’t really evidence about mesa-optimization. Nevertheless, I chose to upvote the post and not comment my thoughts on it. Some reasons why I did that:
I generally upvote most attempts on LW/AF to engage with the academic literature—I think that LW/AF would generally benefit from engaging with academia more and I like to do what I can to encourage that when I see it.
I didn’t feel like any comment I would have made would have anything more to say than things I’ve said in the past. In fact, in “Risks from Learned Optimization” itself, we talk about both a) why we chose to be agnostic about whether current systems exhibit mesa-optimization due to the difficulty of determining whether a system is actually implementing search or not (link) and b) examples of current work that we thought did seem to come closest to being evidence of mesa-optimization such as RL^2 (and I think RL^2 is a better example than the work linked here) (link).
(Flagging that I curated the post, but was mostly relying on Ben and Habryka’s judgment, in part since I didn’t see much disagreement. Since this discussion I’ve become more agnostic about how important this post is)
One thing this comment makes me want is more nuanced reacts that people have affordance to communicate how they feel about a post, in a way that’s easier to aggregate.
Though I also notice that with this particular post it’s a bit unclear what the react would be appropriate, since it sounds like it’s not “disagree” so much as “this post seems confused” or something.
FWIW, I appreciated that your curation notice explicitly includes the desire for more commentary on the results, and that curating it seems to have been a contributor to there being more commentary.
FWIW, I say: don’t let that stop you! (Don’t be afraid to repeat yourself, especially if there’s evidence that the point has not been widely appreciated.)
Unfortunately, I also only have so much time, and I don’t generally think that repeating myself regularly in AF/LW comments is a super great use of it.
Very fair.
The solution is clear: someone needs to create an Evan bot that will comment on every post of the AF related to mesa-optimization, by providing the right pointers to the paper.
Fair enough, those are sensible reasons. I don’t like the fact that the incentive gradient points away from making intellectual progress, but it’s not an obvious choice.
Given karma inflation (as users gain more karma, their votes are worth more, but this doesn’t propagate backwards to earlier votes they cast, and more people become AF voters than lose AF voter status), I think the karma differences between this post and these other 4 50+ karma posts [1 2 3 4] are basically noise. So I think the actual question is “is this post really in that tier?”, to which “probably not” seems like a fair answer.
[I am thinking more about other points you’ve made, but it seemed worth writing a short reply on that point.]
Agreed. I still think I should update fairly strongly.
I agree this sort of thing is something you could mean by the “likelihood of mesa optimization”. As I said in the grandparent:
In practice, when people say they “updated in favor of mesa optimization”, they refer to evidence that says approximately nothing about what is “happening at runtime”; therefore I infer that they cannot be talking about mesa optimization in the sense you mean.