Knight Lee comments on Alignment Proposal: Adversarially Robust Augmentation and Distillation

Knight Lee 26 May 2025 5:22 UTC
2 points
4
I agree. Being able to prove a piece of real world advice is harmless using math or logic, means either
1. Mathematically proving what will happen in the real world if that real world advice was carried out (which requires a mathematically perfect world model)
  or
2. At least proving that the mind generating the advice has aligned goals, so that it’s unlikely to be harmful (but one of the hardest parts of solving alignment is a provable proxy for alignedness)
PS: I don’t want to pile on criticism because I feel ambitious new solutions need to be encouraged! It’s worthwhile studying chains of weaker agents controlling stronger agents, and I actually love the idea of running various safety bureaucratic processes, and distilling the output through imitation. Filtering dangerous advice through protocols seems very rational.