This is a real contribution to soft-aligned capabilities, so I’m single-upvoting it. However, this does not appear to me to be less vulnerable to issues with incentive causality. The severity of adversarial examples to the reward model appears to still be unbounded. As far as I can tell from a skim, it has done a good job optimizing for the reward models. But it still seems quite plausible that large amounts of counterfactual model behavior in untested regions of input-space could be optimized towards misbehavior by the RLHF class of approaches that this method advances. I’d encourage y’all to look into work on formal bounds, eg this interesting paper; causal patterns in different rl algorithms and how those affect alignment, eg the work by the causal incentives group at the deepmind safety team; and I’d also encourage looking a little more at the threat models folks have proposed that your research may indirectly affect, such as what a path toward the limit of biotechnology looks like and the general property that effective-plan-space contains many plans which injure humanity—seems like your group is focused on the local landscape of issues that arise for language models.
This is a real contribution to soft-aligned capabilities, so I’m single-upvoting it. However, this does not appear to me to be less vulnerable to issues with incentive causality. The severity of adversarial examples to the reward model appears to still be unbounded. As far as I can tell from a skim, it has done a good job optimizing for the reward models. But it still seems quite plausible that large amounts of counterfactual model behavior in untested regions of input-space could be optimized towards misbehavior by the RLHF class of approaches that this method advances. I’d encourage y’all to look into work on formal bounds, eg this interesting paper; causal patterns in different rl algorithms and how those affect alignment, eg the work by the causal incentives group at the deepmind safety team; and I’d also encourage looking a little more at the threat models folks have proposed that your research may indirectly affect, such as what a path toward the limit of biotechnology looks like and the general property that effective-plan-space contains many plans which injure humanity—seems like your group is focused on the local landscape of issues that arise for language models.