We argue that a sustained commitment to transparency (e.g. to auditors) would make the RLHF research environment more robust from a safety standpoint.
Do you think this is more true of RLHF than other safety techniques or frameworks? At first blush, I would have thought “no”, and the reasoning you provide in this post doesn’t seem to distinguish RLHF from other things.
No, I don’t think the core advantages of transparency are really unique to RLHF, but in the paper, we list certain things that are specific to RLHF which we think should be disclosed. Thanks.
Do you think this is more true of RLHF than other safety techniques or frameworks? At first blush, I would have thought “no”, and the reasoning you provide in this post doesn’t seem to distinguish RLHF from other things.
No, I don’t think the core advantages of transparency are really unique to RLHF, but in the paper, we list certain things that are specific to RLHF which we think should be disclosed. Thanks.