Rohin Shah comments on Review AI Alignment posts to help figure out how to make a proper AI Alignment review

Rohin Shah 10 Jan 2023 10:52 UTC
LW: 28 AF: 17
21
AF
one of the things that has come up most frequently (including in conversations with Rohin, Buck and various Open Phil people) is that many people wish there was more of a review process in the AI Alignment field.
Hmm, I think I’ve complained a bunch about lots of AI alignment work being conceptually confused, or simply stating points rather than arguing for them, or being otherwise epistemically sketchy. But I also don’t particularly feel optimistic about a review process either; for that to fix these problems the reviewers would have to be more epistemically competent than the post authors, and that currently doesn’t seem likely to happen.
Also, when I actually imagine what the reviews would look like, I mostly think of people talking about the same old cruxes and disagreements that change whether or not the work is worth doing at all, rather than actually talking about the details, which is what I would usually find useful about reviews.
(Tbc, it’s possible I did express optimism about a review process in conversation with you; my opinions could have changed a bunch. I would be a bit surprised though.)
- Orpheus16 10 Jan 2023 19:44 UTC
  6 points
  2
  Parent
  How would you feel about a review process that had two sections?
  
  Section One: How important do you find this work & to what extent do you think the research is worth doing? (Ex: Does it strike at what you see as core alignment problems?)
  
  Section Two: What do you think of the details of the research? (Ex: Do you see any methodological flaws, do you have any ideas for further work, etc).
  
  My impression is that academic peer-reviewers generally do both of these. Compared to academic peer-review, LW/AF discussions tend to have a lot of Section One and not much Section Two.
  
  My (low-resilience) guess is that the field would benefit from more “Section Two Engagement” from people with different worldviews.
  
  (On the other hand, perhaps people with different worldviews and research priorities won’t have a comparative advantage in offering specific, detailed critiques. Ex: An agent foundations researcher might not be very good at providing detailed critiques of an interpretability paper.)
  
  (But to counter that point, maybe there are certain specific/detailed critiques that are easier for “outsiders” to catch.)
  - Rohin Shah 11 Jan 2023 8:18 UTC
    4 points
    2
    Parent
    Disclaimer: I’m not particularly confident in any of my views here or in my original comment. I mostly commented on the post because the post implied that I supported the idea of a review process whereas my actual opinion is mixed (not uniformly negative). If I hadn’t been named explicitly I wouldn’t have said anything; I don’t want to stop people from trying this out if they think it would be good; it’s quite plausible that someone thinking about this full time would have a vision for it that would be good that I haven’t even considered yet (given how little I’ve thought about it).
    I think how excited I’d be would depend a lot more on the details (e.g. who are the reviewers, how much time do they spend, how are they incentivized, what happens after the reviews are completed). But if we just imagine the LessWrong Review extended to the Alignment Forum, I’m not that excited, because I predict (not confidently) that the reviews just wouldn’t be good at engaging with the details. (Mostly because LW / AF comments don’t seem very good at engaging with details on existing LW / AF posts, and because typical LW / AF commenters don’t seem familiar enough with ML to judge details in ML papers.)
    My impression is that academic peer-reviewers generally do both of these.
    Academic peer review does do both in principle, but I’d say that typically most of the emphasis is on Section Two. Generally the Section One style review is just “yup, this is in fact trying to make progress on a problem academia has previously deemed important, and is not just regurgitating things that people previously said” (i.e. it is significant and novel).
    (It is common for bad reviews to just say “this is not significant / not novel” and then ignore Section One entirely, but this is pretty commonly thought of as explicitly “this was a bad review”, unless they actually justified “not significant / not novel” well enough that most others would agree with them.)
- Buck 26 Jan 2023 5:30 UTC
  LW: 5 AF: 3
  0
  AF Parent
  But I also don’t particularly feel optimistic about a review process either; for that to fix these problems the reviewers would have to be more epistemically competent than the post authors, and that currently doesn’t seem likely to happen.
  For what it’s worth, this is also where I’m at on an Alignment Forum review.
  - Raemon 26 Jan 2023 20:34 UTC
    LW: 4 AF: 2
    2
    AF Parent
    I’ve been trying to articulate some thoughts since Rohin’s original comment, and maybe going to just rant-something-out now.
    On one hand: I don’t have a confident belief that writing in-depth reviews is worth Buck or Rohin’s time (or their immediate colleague’s time for that matter). It’s a lot of work, there’s a lot of other stuff worth doing. And I know at least Buck and Rohin have already spent quite a lot of time arguing about the conceptual deep disagreements for many of the top-voted posts.
    On the other hand, the combination of “there’s stuff epistemically wrong or confused or sketchy about LW”, but “I don’t trust a review process to actually work because I don’t believe the it’ll get better epistemics than what have already been demonstrated” seems a combination of “self-defeatingly wrong” and “also just empirically (probably) wrong”.
    Presumably Rohin and Buck and similar colleagues think they have at least (locally) better epistemics than the writers they’re frustrated by.
    I’m guessing your take is like “I, Buck/Rohin, could write a review that was epistemically adequate, but I’m busy and don’t expect it to accomplish anything that useful.” Assuming that’s a correct characterization, I don’t necessarily disagree (at least not confidently). But something about the phrasing feels off.
    Some reasons it feels off:
    Even if there are clusters of research that seem too hopeless to be worth engaging with, I’d be very surprised if there weren’t at least some clusters of research that Rohin/Buck/etc are more optimistic about. If what happens is “people write reviews of the stuff that feels real/important enough to be worth engaging with”, that still seems valuable to me.
    It seems like people are sort of treating this like a stag-hunt, and it’s not worth participating if a bunch of other effort isn’t going in. I do think there are network effects that make it more valuable as more people participate. But I also think “people incrementally do more review work each year as it builds momentum” is pretty realistic, and I think individual thoughtful reviews are useful in isolation for building clarity on individual posts.
    The LessWrong/Alignment Review process is pretty unopinionated at the moment. If you think a particular type of review is more valuable than other types, there’s nothing stopping you from doing that type of review.
    If the highest review-voted work is controversial, I think it’s useful for the field orienting to know that it’s controversial. It feels pretty reasonable to me to publish an Alignment Forum Journal-ish-thing that includes the top-voted content, with short reviews from other researchers saying “FYI I disagree conceptually here about this post being a good intellectual output”
    (or, stepping out of the LW-review frame: if the alignment field is full of controversy and people who think each other are confused, I think this is a fairly reasonable fact to come out of any kind of review process)
    I’m skeptical that the actual top-voted posts trigger this reaction. At the time of this post, the top voted posts were:
    ARC’s first technical report: Eliciting Latent Knowledge
    What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)
    Another (outer) alignment failure story
    Finite Factored Sets
    Ngo and Yudkowsky on alignment difficulty
    
    Of those, my guess is that only “Ngo and Yudkowsky on alignment difficulty” and maybe “Another (outer) alignment failure” story have the qualities I understood Rohin to be commenting on. (which isn’t to say I expect him to think the other posts are all valuable, I’m not sure, just that they don’t seem to have the particular failure modes he was frustrated by)
    I do think a proper alignment review should likely have more content that wasn’t published on alignment forum. This was technically available this year (we allowed people to submit non-LW content during the nomination phase), but we didn’t promote it very heavily and it didn’t frame it as a “please submit all Alignment progress you think was particularly noteworthy” to various researchers.
    I don’t know that the current review process is great, but, again, it’s fairly unopinionated and leaves plenty of room to be-the-change-you-want-to-see in the alignment scene meta-reflection.
    (aside: I apologize for picking on Rohin and Buck when they bothered to stick their neck out and comment, presumably there are other people who feel similarly who didn’t even bother commenting. I appreciate you sharing your take, and if this feels like dragging you into something you don’t wanna deal with, no worries. But, I think having concrete people/examples is helpful. I also think a lot of what I’m saying applies to people I’d characterize as “in the MIRI camp”, who also haven’t done much reviewing, although I’d frame my response a bit differently)
    - Rohin Shah 5 Feb 2023 10:08 UTC
      LW: 2 AF: 2
      0
      AF Parent
      Sorry, didn’t see this until now (didn’t get a notification, since it was a reply to Buck’s comment).
      I’m guessing your take is like “I, Buck/Rohin, could write a review that was epistemically adequate, but I’m busy and don’t expect it to accomplish anything that useful.”
      In some sense yes, but also, looking at posts I’ve commented on in the last ~6 months, I have written several technical reviews (and nontechnical reviews). And these are only the cases where I wrote a comment that engaged in detail with the main point of the post; many of my other comments review specific claims and arguments within posts.
      (I would be interested in quantifications of how valuable those reviews are to people other than the post authors. I’d think it is pretty low.)
      I’d be very surprised if there weren’t at least some clusters of research that Rohin/Buck/etc are more optimistic about.
      Yes, but they’re usually papers, not LessWrong posts, and I do give feedback to their authors—it just doesn’t happen publicly.
      (And it would be maybe 10x more work to make it public, because (a) I have to now write the review to be understandable by people with wildly different backgrounds and (b) I would hold myself to a higher standard (imo correctly).)
      (Indeed if you look at the reviews I linked above one common thread is that they are responding to specific people whose views I have some knowledge of, and the reviews are written with those people in mind as the audience.)
      I also think “people incrementally do more review work each year as it builds momentum” is pretty realistic
      I broadly agree with this and mostly feel like it is the sort of thing that is happening amongst the folks who are working on prosaic alignment.
      If the highest review-voted work is controversial, I think it’s useful for the field orienting to know that it’s controversial. [...] if the alignment field is full of controversy and people who think each other are confused, I think this is a fairly reasonable fact to come out of any kind of review process
      We already know this though? You have to isolate particular subclusters (Nate/Eliezer, shard theory folks, IDA/debate folks, etc) before it’s even plausible to find pieces of work that might be uncontroversial. We don’t need to go through a review effort to learn that.
      (This is different from beliefs / opinions that are uncontroversial; there are lots of those.)
      (And when I say that they are controversial, I mean that people will disagree significantly on whether it makes progress on alignment, or what the value of the work is; often the work will make technical claims that are uncontroversial. I do think it could be good to highlight which of the technical claims are controversial.)
      I’m skeptical that the actual top-voted posts trigger this reaction.
      What is “this reaction”?
      If you mean “work being conceptually confused, or simply stating points rather than arguing for them, or being otherwise epistemically sketchy”, then I agree there are posts that don’t trigger this reaction (but that doesn’t seem too relevant to whether it is good to write reviews).
      If you mean “reviews of these posts by a randomly selected alignment person would not be very useful”, then I do still have that reaction to every single one of those posts.
- Raemon 17 Jan 2023 1:32 UTC
  LW: 4 AF: 3
  0
  AF Parent
  rather than actually talking about the details, which is what I would usually find useful about reviews.
  I’m interested in details about what you find useful about the prospect of reviews that talk about the details. I share a sense that it’d be helpful, but I’m not sure I could justify that belief very strongly (when it comes to the opportunity cost of the people qualified to do the job)
  In general, I’m legit fairly uncertain whether “effort-reviews”(whether detail-focused or big-picture focused) are worthwhile. It seems plausible to me that detail-focused-reviews are more useful soon after a work is published than 2 years later, and that big-picture-reviews are more useful in the “two year retrospective” sense (and maybe we should figure out some way to get detail-oriented reviews done more frequently, faster).
  It does seem to me that, by the time a book is being considered for “long-term-valuable’, I would like someone, at some point, to have done a detail-oriented review examining all of the fiddly pieces. In some cases, that review has been done before the post was even published, in a private google doc.
  - Rohin Shah 18 Jan 2023 16:26 UTC
    LW: 6 AF: 3
    1
    AF Parent
    A couple of reasons:
    It’s far easier for me to figure out how much to update on evidence when someone else has looked at the details and highlighted ways in which the evidence is stronger or weaker than a reader might naively take away from the paper. (At least, assuming the reviewer did a good job.)
    This doesn’t apply to big-picture reviews because such reviews are typically a rehash of old arguments I already know.
    This is similar to the general idea in AI safety via debate—when you have access to a review you are more like a judge; without a review you are more like the debate opponent.
    Having someone else explain the paper from their perspective can surface other ways of thinking about the paper that can help with understanding it.
    This sometimes does happen with big-picture reviews, though I think it’s less common.
    Tbc, I’m not necessarily saying it is worth the opportunity cost of the reviewer’s time; I haven’t thought much about it.
- Joe Collman 13 Jan 2023 12:17 UTC
  LW: 3 AF: 2
  0
  AF Parent
  for that to fix these problems the reviewers would have to be more epistemically competent than the post authors
  I think this is an overstatement. They’d need to notice issues the post authors missed. That doesn’t require greater epistemic competence: they need only tend to make different mistakes, not fewer mistakes.
  Certainly there’s a point below which the signal-to-noise ratio is too low. I agree that high reviewer quality is important.
  On the “same old cruxes and disagreements” I imagine you’re right—but to me that suggests we need a more effective mechanism to clarify/resolve them (I think you’re correct in implying that review is not that mechanism—I don’t think academic review achieves this either). It’s otherwise unsurprising that they bubble up everywhere.
  I don’t have any clear sense of the degree of time and effort that has gone into clarifying/resolving such cruxes, and I’m sure it tends to be a frustrating process. However, my guess is that the answer is “nowhere close to enough”. Unless researchers have very high confidence that they’re on the right side of such disagreements, it seems appropriate to me to spend ~6 months focusing on purely this (of course this would require coordination, and presumably seems wildly impractical).
  My sense is that nothing on this scale happens (right?), and that the reasons have more to do with (entirely understandable) impracticality, coordination difficulties and frustration, than with principled epistemics and EV calculations.
  But perhaps I’m way off? My apologies if this is one of the same old cruxes and disagreements :).
  - Rohin Shah 13 Jan 2023 22:53 UTC
    LW: 3 AF: 3
    0
    AF Parent
    That doesn’t require greater epistemic competence: they need only tend to make different mistakes, not fewer mistakes.
    Yes, that’s true, I agree my original comment is overstated for this reason. (But it doesn’t change my actual prediction about what would happen; I still don’t expect reviewers to catch issues.)
    My sense is that nothing on this scale happens (right?)
    I’d guess that I’ve spent around 6 months debating these sorts of cruxes and disagreements (though not with a single person of course). I think the main bottleneck is finding avenues that would actually make progress.
    - Joe Collman 14 Jan 2023 1:58 UTC
      LW: 2 AF: 2
      0
      AF Parent
      Ah, well that’s mildly discouraging (encouraging that you’ve made this scale of effort; discouraging in what it says about the difficulty of progress).
      I’d still be interested to know what you’d see as a promising approach here—if such crux resolution were the only problem, and you were able to coordinate things as you wished, what would be a (relatively) promising strategy?
      But perhaps you’re already pursuing it? I.e. if something like [everyone works on what they see as key problems, increases their own understanding and shares insights] seems most likely to open up paths to progress.
      Assuming review wouldn’t do much to help on this, have you thought about distributed mechanisms that might? E.g. mapping out core cruxes and linking all available discussions where they seem a fundamental issue (potentially after holding/writing-up a bunch more MIRI Dialogues style interactions [which needn’t all involve MIRI]).
      Does this kind of thing seem likely to be of little value—e.g. because it ends up clearly highlighting where different intuitions show up, but shedding little light on their roots or potential justification?
      I suppose I’d like to know what shape of evidence seems most likely to lead to progress—and whether much/any of it might be unearthed through clarification/distillation/mapping of existing ideas. (where the mapping doesn’t require connections that only people with the deepest models will find)
      - Rohin Shah 16 Jan 2023 11:58 UTC
        LW: 5 AF: 4
        0
        AF Parent
        Personally if I were trying to do this I’d probably aim to do a combination of:
        Identify what kinds of reasoning people are employing, investigate under what conditions they tend to lead to the truth. E.g. one way that I think I differ from many others is that I am skeptical of analogies as direct evidence about the truth; I see the point of analogies as (a) tools for communicating ideas more effectively and (b) locating hypotheses that you then verify by understanding the underlying mechanism and checking that the mechanism ports (after which you don’t need the analogy any more).
        State arguments more precisely and rigorously, to narrow in on more specific claims that people disagree about (note there are a lot of skulls along this road)
  - Raemon 13 Jan 2023 17:51 UTC
    LW: 3 AF: 2
    0
    AF Parent
    FWIW I think a fairly substantial amount of effort has gone into resolving longstanding disagreements. I think that effort has resulted in a lot of good works and updates from many people reading about the disagreement discussion, but not really changed the mind of the people doing the arguing. (See: the MIRI Dialogues)
    And it’s totally plausible to me the answer is “10-100x the amount of work that is gone in so far.”
    I maybe agree that people haven’t literally sat and double-cruxed for six months. I don’t know that it’s fair to describe this as “impracticality, coordination difficulties and frustration” instead of “principled epistemics and EV calculations.” Like, if you’ve done a thing a bunch and it doesn’t seem to be working and you feel like you have traction on another thing, it’s not crazy to do the other thing.
    (That said, I do still have the gut level feeling of ‘man it’s absolutely bonkers that in the so-called rationality community a lot of prominent thinkers still disagree about such fundamental stuff.’)
    - Joe Collman 13 Jan 2023 19:05 UTC
      LW: 1 AF: 1
      0
      AF Parent
      Oh sure, I certainly don’t mean to imply that there’s been little effort in absolute terms—I’m very encouraged by the MIRI dialogues, and assume there are a bunch of behind-the-scenes conversations going on.
      I also assume that everyone is doing what seems best in good faith, and has potentially high-value demands on their time.
      However, given the stakes, I think it’s a time for extraordinary efforts—and so I worry that [this isn’t the kind of thing that is usually done] is doing too much work.
      I think the “principled epistemics and EV calculations” could perfectly well be the explanation, if it were the case that most researchers put around a 1% chance on [Eliezer/Nate/John… are largely correct on the cruxy stuff].
      That’s not the sense I get—more that many put the odds somewhere around 5% to 25%, but don’t believe the arguments are sufficiently crisp to allow productive engagement.
      If I’m correct on that (and I may well not be), it does not seem a principled justification for the status-quo. Granted the right course isn’t obvious—we’d need whoever’s on the other side of the double-cruxing to really know their stuff. Perhaps Paul’s/Rohin’s… time is too valuable for a 6 month cost to pay off. (the more realistic version likely involves not-quite-so-valuable people from each ‘side’ doing it)
      As for “done a thing a bunch and it doesn’t seem to be working”, what’s the prior on [two experts in a field from very different schools of thought talk for about a week and try to reach agreement]? I’m no expert, but I strongly expect that not to work in most cases.
      To have a realistic expectation of its working, you’d need to be doing the kinds of thing that are highly non-standard. Experts having some discussions over a week is standard. Making it your one focus for 6 months is not. (frankly, I’d be over the moon for the one month version [but again, for all I know this may have been tried])
      - Noosphere89 13 Jan 2023 21:32 UTC
        1 point
        0
        Parent
        Even more importantly, Aumann’s Agreement Theorem demands that both sides eventually agree on something, and the fact that the AI Alignment field hasn’t agreed yet is concerning.
        
        Here’s the link:
        
        https://www.lesswrong.com/tag/aumann-s-agreement-theorem
- habryka 11 Jan 2023 15:11 UTC
  LW: 2 AF: 1
  0
  AF Parent
  This was quite a while ago, probably over 2 years, though I do feel like I remember it quite distinctly. I guess my model of you has updated somewhat here over the years, and now is more interested in heads-down work.
  - Rohin Shah 11 Jan 2023 15:45 UTC
    LW: 4 AF: 3
    2
    AF Parent
    Yeah, that sounds entirely plausible if it was over 2 years ago, just because I’m terrible at remembering my opinions from that long ago.