This is a great research direction, because if developed enough, it would actually make better interpretability more desirable for all model developers.
RLHF and RLVR often come with unfortunate side effects, many of which are hard to dislodge. If this methodology could be advanced enough to be able to target and remove a lot of those side effects? I can’t think of a frontier lab that wouldn’t want that.
This is a great research direction, because if developed enough, it would actually make better interpretability more desirable for all model developers.
RLHF and RLVR often come with unfortunate side effects, many of which are hard to dislodge. If this methodology could be advanced enough to be able to target and remove a lot of those side effects? I can’t think of a frontier lab that wouldn’t want that.