Stephen McAleese comments on Open Problems and Fundamental Limitations of RLHF

Stephen McAleese 31 Jul 2023 21:29 UTC
7 points
4
Thanks for writing the paper! I think it will be really impactful and I think it fills a big gap in the literature.
I’ve always wondered what problems RLHF had and mostly I’ve seen only short informal answers about how it incentivizes deception or how humans can’t provide a scalable signal for superhuman tasks which is odd because it’s one of the most commonly used AI alignment methods.
Before your paper, I think this post was the most in-depth analysis of problems with RLHF I’ve seen so I think your paper is now probably the best resource for problems with RLHF. Apart from that post, the List of Lethalities post has a few related sections and this post by John Wentworth has a section on RLHF.
I’m sure your paper will spark future research on improving RLHF because it lists several specific discrete problems that could be tackled!
- scasper 31 Jul 2023 21:41 UTC
  6 points
  6
  Parent
  Thanks, and +1 to adding the resources. Also Charbel-Raphael who authored the in-depth post is one of the authors of this paper! That post in particular was something we paid attention to during the design of the paper.