Rohin Shah comments on Refactoring Alignment (attempt #2)

Rohin Shah 28 Jul 2021 7:32 UTC
LW: 2 AF: 2
AF
The definition of “objective robustness” I used says “aligns with the base objective” (including off-distribution). But I think this isn’t an appropriate representation of your approach. Rather, “objective robustness” has to be defined something like “generalizes acceptably”. Then, ideas like adversarial training and checks and balances make sense as a part of the story.
Yeah, strong +1.
- abramdemski 28 Jul 2021 15:13 UTC
  LW: 2 AF: 2
  AF Parent
  Great! I feel like we’re making progress on these basic definitions.