David Matolcsi comments on Validating against a misalignment detector is very different to training against one

David Matolcsi 4 Mar 2025 16:50 UTC
LW: 6 AF: 4
2
AF
I like the main idea of the post. It’s important to note though that the setup assumed that we have a bunch of alignnent ideas that all have an independent 10% chance of working. Meanwhile, in reality I expect a lot of correlation: there is a decent chance that alignment is easy and a lot of our ideas will work, and a decent chance that it’s hard and basically nothing works.
- mattmacdermott 4 Mar 2025 17:00 UTC
  LW: 3 AF: 1
  1
  AF Parent
  Agreed, this only matters in the regime where some but not all of your ideas will work. But even in alignment-is-easy worlds, I doubt literally everything will work, so testing would still be helpful.