I note that this doesn’t feel like a problem to me, mostly because of reasons related to Explainers Shoot High. Aim Low!. Even among ML experts, many of them haven’t touched much RL, because they’re focused on another field. Why expect them to know basic RL theory, or to have connected that to all the other things that they know?
I’m perfectly happy with good explanations that don’t assume background knowledge. The flaw I am pointing to has nothing to do with explanations. It is that despite this evidence being a clear consequence of basic RL theory, for some reason readers are treating it as important evidence. Clearly I should update negatively on things-AF-considers-important. At a more gears level, presumably I should update towards some combination of:
AF readers don’t know RL.
AF readers upvote anything that’s cheering for their team.
AF readers automatically believe anything written in a post without checking it
Any of these would be a pretty damning critique of the forum. And the update should be fairly strong, given that this was (prior to my comment) the highest-upvoted post ever by AF karma.
I think people often underestimate the degree to which, if they want to see their opinions in a public forum, they will have to be the one to post them.
If you saw a post that ran an experiment where they put their hand in boiling water, and the conclusion was “boiling water is dangerous”, and you saw it get to be the most upvoted post ever on LessWrong, with future posts citing it as evidence for boiling water being dangerous, would your reaction be “huh, I guess I need to state my opinion that this is obvious”?
There’s a difference between “I’m surprised no one has made this connection to this other thing” and “I’m surprised that readers are updating on facts that I expected them to already know”.
I don’t usually expect my opinions to show up on a public forum. For example, I am continually sad but not surprised about the fact that AF focuses on mesa optimizers as separate from capability generalization without objective generalization.
I guess I should explain why I upvoted this post despite agreeing with you that it’s not new evidence in favor of mesa-optimization. I actually had a conversation about this post with Adam Shimi prior to you commenting on it where I explained to him that I thought that not only was none of it new but also that it wasn’t evidence about the internal structure of models and therefore wasn’t really evidence about mesa-optimization. Nevertheless, I chose to upvote the post and not comment my thoughts on it. Some reasons why I did that:
I generally upvote most attempts on LW/AF to engage with the academic literature—I think that LW/AF would generally benefit from engaging with academia more and I like to do what I can to encourage that when I see it.
I didn’t feel like any comment I would have made would have anything more to say than things I’ve said in the past. In fact, in “Risks from Learned Optimization” itself, we talk about both a) why we chose to be agnostic about whether current systems exhibit mesa-optimization due to the difficulty of determining whether a system is actually implementing search or not (link) and b) examples of current work that we thought did seem to come closest to being evidence of mesa-optimization such as RL^2 (and I think RL^2 is a better example than the work linked here) (link).
(Flagging that I curated the post, but was mostly relying on Ben and Habryka’s judgment, in part since I didn’t see much disagreement. Since this discussion I’ve become more agnostic about how important this post is)
One thing this comment makes me want is more nuanced reacts that people have affordance to communicate how they feel about a post, in a way that’s easier to aggregate.
Though I also notice that with this particular post it’s a bit unclear what the react would be appropriate, since it sounds like it’s not “disagree” so much as “this post seems confused” or something.
FWIW, I appreciated that your curation notice explicitly includes the desire for more commentary on the results, and that curating it seems to have been a contributor to there being more commentary.
I didn’t feel like any comment I would have made would have anything more to say than things I’ve said in the past.
FWIW, I say: don’t let that stop you! (Don’t be afraid to repeat yourself, especially if there’s evidence that the point has not been widely appreciated.)
Unfortunately, I also only have so much time, and I don’t generally think that repeating myself regularly in AF/LW comments is a super great use of it.
The solution is clear: someone needs to create an Evan bot that will comment on every post of the AF related to mesa-optimization, by providing the right pointers to the paper.
Fair enough, those are sensible reasons. I don’t like the fact that the incentive gradient points away from making intellectual progress, but it’s not an obvious choice.
And the update should be fairly strong, given that this was (prior to my comment) the highest-upvoted post ever by AF karma.
Given karma inflation (as users gain more karma, their votes are worth more, but this doesn’t propagate backwards to earlier votes they cast, and more people become AF voters than lose AF voter status), I think the karma differences between this post and these other 4 50+ karma posts [1234] are basically noise. So I think the actual question is “is this post really in that tier?”, to which “probably not” seems like a fair answer.
[I am thinking more about other points you’ve made, but it seemed worth writing a short reply on that point.]
I’m perfectly happy with good explanations that don’t assume background knowledge. The flaw I am pointing to has nothing to do with explanations. It is that despite this evidence being a clear consequence of basic RL theory, for some reason readers are treating it as important evidence. Clearly I should update negatively on things-AF-considers-important. At a more gears level, presumably I should update towards some combination of:
AF readers don’t know RL.
AF readers upvote anything that’s cheering for their team.
AF readers automatically believe anything written in a post without checking it
Any of these would be a pretty damning critique of the forum. And the update should be fairly strong, given that this was (prior to my comment) the highest-upvoted post ever by AF karma.
If you saw a post that ran an experiment where they put their hand in boiling water, and the conclusion was “boiling water is dangerous”, and you saw it get to be the most upvoted post ever on LessWrong, with future posts citing it as evidence for boiling water being dangerous, would your reaction be “huh, I guess I need to state my opinion that this is obvious”?
There’s a difference between “I’m surprised no one has made this connection to this other thing” and “I’m surprised that readers are updating on facts that I expected them to already know”.
I don’t usually expect my opinions to show up on a public forum. For example, I am continually sad but not surprised about the fact that AF focuses on mesa optimizers as separate from capability generalization without objective generalization.
I guess I should explain why I upvoted this post despite agreeing with you that it’s not new evidence in favor of mesa-optimization. I actually had a conversation about this post with Adam Shimi prior to you commenting on it where I explained to him that I thought that not only was none of it new but also that it wasn’t evidence about the internal structure of models and therefore wasn’t really evidence about mesa-optimization. Nevertheless, I chose to upvote the post and not comment my thoughts on it. Some reasons why I did that:
I generally upvote most attempts on LW/AF to engage with the academic literature—I think that LW/AF would generally benefit from engaging with academia more and I like to do what I can to encourage that when I see it.
I didn’t feel like any comment I would have made would have anything more to say than things I’ve said in the past. In fact, in “Risks from Learned Optimization” itself, we talk about both a) why we chose to be agnostic about whether current systems exhibit mesa-optimization due to the difficulty of determining whether a system is actually implementing search or not (link) and b) examples of current work that we thought did seem to come closest to being evidence of mesa-optimization such as RL^2 (and I think RL^2 is a better example than the work linked here) (link).
(Flagging that I curated the post, but was mostly relying on Ben and Habryka’s judgment, in part since I didn’t see much disagreement. Since this discussion I’ve become more agnostic about how important this post is)
One thing this comment makes me want is more nuanced reacts that people have affordance to communicate how they feel about a post, in a way that’s easier to aggregate.
Though I also notice that with this particular post it’s a bit unclear what the react would be appropriate, since it sounds like it’s not “disagree” so much as “this post seems confused” or something.
FWIW, I appreciated that your curation notice explicitly includes the desire for more commentary on the results, and that curating it seems to have been a contributor to there being more commentary.
FWIW, I say: don’t let that stop you! (Don’t be afraid to repeat yourself, especially if there’s evidence that the point has not been widely appreciated.)
Unfortunately, I also only have so much time, and I don’t generally think that repeating myself regularly in AF/LW comments is a super great use of it.
Very fair.
The solution is clear: someone needs to create an Evan bot that will comment on every post of the AF related to mesa-optimization, by providing the right pointers to the paper.
Fair enough, those are sensible reasons. I don’t like the fact that the incentive gradient points away from making intellectual progress, but it’s not an obvious choice.
Given karma inflation (as users gain more karma, their votes are worth more, but this doesn’t propagate backwards to earlier votes they cast, and more people become AF voters than lose AF voter status), I think the karma differences between this post and these other 4 50+ karma posts [1 2 3 4] are basically noise. So I think the actual question is “is this post really in that tier?”, to which “probably not” seems like a fair answer.
[I am thinking more about other points you’ve made, but it seemed worth writing a short reply on that point.]
Agreed. I still think I should update fairly strongly.