(EDIT: I’m already seeing downvotes of the post, it was originally at 58 AF karma. This wasn’t my intention: I think this is a failure of the community as a whole, not of the author.)
I’m very confused by this edit.
My model of the community’s failure is roughly
This post primarily argues that a phenomenon is evidence for [learned models being likely to encode search algorithms], but in fact it is not.
We would like the community to be such that this is pointed out quickly, the author edits the post accordingly, and the post does not get super high reception
Instead, the post has high karma, is curated, this wasn’t pointed out until you said it, and the post has not been edited.
If part of the failure is that the post is well-received, why wouldn’t you want people to downvote it now that you pointed it out?
I also think the average LW user shouldn’t be expected to understand enough RL to see this, so the system should detect this kind of failure for them. (Which it has done now that you’ve written your comment.) For those people, the proper reaction seems to be to remove their upvote and perhaps downvote.
Separately, I think you can explain part of the failure by laziness rather than a lack of understanding of RL. You could read/skim this post and not quite understand what the setting actually is (even though it’s mentioned at the end of the second chapter). Just like I don’t think the average LW user should be expected to understand enough ML to realize that the main point is misleading, I also don’t they they should be expected to read the post carefully enough before upvoting it, especially not if it’s curated or high karma (because that should be a quality assurance, and at that point it seems fine to upvote purely to signal-boost the point).
I realize your critique was of the AF, not of LW, so I’m not sure how much I’m really disagreeing with you here. But since Evan Hubinger understood the point and upvoted the post anyway, it’s unclear how much you can conclude. (EDIT after rohin’s answer: actually, I agree this is most likely not a typical case.)
If part of the failure is that the post is well-received, why wouldn’t you want people to downvote it now that you pointed it out?
It feels like downvotes-as-I-see-them-in-practice are some combination of “you should feel bad about having written this” and “make worse content less visible”, and I didn’t want the first effect. Idk if that’s the right call though, and idk if that’s how others (especially the author) interpret it.
I also neglected that people can just remove their upvotes without downvoting, which feels less bad (though from the author’s perspective it’s the same, so I think I’m just being inconsistent here).
I also think the average LW user shouldn’t be expected to understand enough RL to see this
Agreed, which is why I focused on the AF karma rather than the LW karma. (I agree with the rest of that paragraph.)
Separately, I think you can explain part of the failure by laziness rather than a lack of understanding of RL. You could read/skim this post and not quite understand what the setting actually is (even though it’s mentioned at the end of the second chapter). Just like I don’t think the average LW user should be expected to understand enough ML to realize that the main point is misleading, I also don’t they they should be expected to read the post carefully enough before upvoting it, especially not if it’s curated or high karma (because that should be a quality assurance, and at that point it seems fine to upvote purely to signal-boost the point).
Agreed this is likely but still seems pretty bad—this isn’t the first time people would have updated incorrectly had I not made a correction, though this is the most upvoted case. (I perhaps find it more annoying than it really “should” be because of how much shit LW gives academia and peer review.)
I realize your critique was of the AF, not of LW, so I’m not sure how much I’m really disagreeing with you here.
Yeah, I think this would still be a critique of LW, but much less strongly.
But since Evan Hubinger understood the point and upvoted the post anyway, it’s unclear how much you can conclude.
I give it 98% chance that the majority of people who upvoted did not understand the point.
Agreed, which is why I focused on the AF karma rather than the LW karma
I think it’s worth pointing out that I originally saw this just posted to LW, and must have been manually promoted to AF by a mod. Partly want to point it out because possibly one of the main errors is people updating too much on promotion as a signal of quality
It’s trivially correct to update downward on the de-facto importance of promotion (by however much), but this seems like a bad thing.
Naively, I would like people to make sure they understand the point at
the curation step
the promotion-to-AF step
maybe at the upvote step if you’re a professional AI safety researcher
And if the conclusion is that the post is meaningful despite possibly being misinterpreted, I would naively want the person in charge to PM the author and ask to put in a clarification before the post is curated/promoted.
I say ‘naively’ because I don’t know anything about how hard it would be to achieve this and I could be genuinely wrong about this being a reasonable thing to want.
This post primarily argues that a phenomenon is evidence for [learned models being likely to encode search algorithms]
I do mention interpreting the described results as tentative evidence for mesa-optimization, and this interpretation was why I wrote the post; my impression is still that this interpretation was basically correct. But most of the post is just quotes or paraphrased claims made by DeepMind researchers, rather than my own claims, since I didn’t feel sure enough to make the claims myself.
I’m very confused by this edit.
My model of the community’s failure is roughly
This post primarily argues that a phenomenon is evidence for [learned models being likely to encode search algorithms], but in fact it is not.
We would like the community to be such that this is pointed out quickly, the author edits the post accordingly, and the post does not get super high reception
Instead, the post has high karma, is curated, this wasn’t pointed out until you said it, and the post has not been edited.
If part of the failure is that the post is well-received, why wouldn’t you want people to downvote it now that you pointed it out?
I also think the average LW user shouldn’t be expected to understand enough RL to see this, so the system should detect this kind of failure for them. (Which it has done now that you’ve written your comment.) For those people, the proper reaction seems to be to remove their upvote and perhaps downvote.
Separately, I think you can explain part of the failure by laziness rather than a lack of understanding of RL. You could read/skim this post and not quite understand what the setting actually is (even though it’s mentioned at the end of the second chapter). Just like I don’t think the average LW user should be expected to understand enough ML to realize that the main point is misleading, I also don’t they they should be expected to read the post carefully enough before upvoting it, especially not if it’s curated or high karma (because that should be a quality assurance, and at that point it seems fine to upvote purely to signal-boost the point).
I realize your critique was of the AF, not of LW, so I’m not sure how much I’m really disagreeing with you here. But since Evan Hubinger understood the point and upvoted the post anyway, it’s unclear how much you can conclude. (EDIT after rohin’s answer: actually, I agree this is most likely not a typical case.)
It feels like downvotes-as-I-see-them-in-practice are some combination of “you should feel bad about having written this” and “make worse content less visible”, and I didn’t want the first effect. Idk if that’s the right call though, and idk if that’s how others (especially the author) interpret it.
I also neglected that people can just remove their upvotes without downvoting, which feels less bad (though from the author’s perspective it’s the same, so I think I’m just being inconsistent here).
Agreed, which is why I focused on the AF karma rather than the LW karma. (I agree with the rest of that paragraph.)
Agreed this is likely but still seems pretty bad—this isn’t the first time people would have updated incorrectly had I not made a correction, though this is the most upvoted case. (I perhaps find it more annoying than it really “should” be because of how much shit LW gives academia and peer review.)
Yeah, I think this would still be a critique of LW, but much less strongly.
I give it 98% chance that the majority of people who upvoted did not understand the point.
I think it’s worth pointing out that I originally saw this just posted to LW, and must have been manually promoted to AF by a mod. Partly want to point it out because possibly one of the main errors is people updating too much on promotion as a signal of quality
It’s trivially correct to update downward on the de-facto importance of promotion (by however much), but this seems like a bad thing.
Naively, I would like people to make sure they understand the point at
the curation step
the promotion-to-AF step
maybe at the upvote step if you’re a professional AI safety researcher
And if the conclusion is that the post is meaningful despite possibly being misinterpreted, I would naively want the person in charge to PM the author and ask to put in a clarification before the post is curated/promoted.
I say ‘naively’ because I don’t know anything about how hard it would be to achieve this and I could be genuinely wrong about this being a reasonable thing to want.
I do mention interpreting the described results as tentative evidence for mesa-optimization, and this interpretation was why I wrote the post; my impression is still that this interpretation was basically correct. But most of the post is just quotes or paraphrased claims made by DeepMind researchers, rather than my own claims, since I didn’t feel sure enough to make the claims myself.