I agree that the quotations described as “backwards” are not necessarily wrong given the two possible (and reasonable) interpretations of the RLHF procedure. Thanks for flagging this subtlety; I had not thought of it before. I will update the body of the post to reflect this subtlety.
Meta point: I’m so grateful for the LessWrong community. This is my first post and first comment, and I find it so wild that I’m part of a community where people like you write such insightful comments. It’s very inspiring :)
Thanks for the comment. Strong upvoted!
I agree that the quotations described as “backwards” are not necessarily wrong given the two possible (and reasonable) interpretations of the RLHF procedure. Thanks for flagging this subtlety; I had not thought of it before. I will update the body of the post to reflect this subtlety.
Meta point: I’m so grateful for the LessWrong community. This is my first post and first comment, and I find it so wild that I’m part of a community where people like you write such insightful comments. It’s very inspiring :)