Davidmanheim comments on Coverage-driven alignment—What ‘Teaching Claude Why’ can borrow from AV verification

Davidmanheim 8 Jun 2026 15:09 UTC
4 points
0
This seems right for all the approaches except Anthropic’s. That is, if we expect that alignment is achieved via virtue ethics, coverage drive iteration seems largely superfluous; we don’t expect that CEOs with more experience are more ethical. The place it becomes critical is for those not betting on character as an alignment method, or for Anthropic guarding against alignment deterioration during RL, since that work may partially be happening after the character and alignment training.
- Yoav Hollander 8 Jun 2026 16:57 UTC
  10 points
  0
  Parent
  Thanks David. I’d put it slightly differently: CDV isn’t trying to make the model more ethical through iteration (you’re right that experience doesn’t make a CEO more ethical). It is trying to find out where the character training actually held and where it silently didn’t. Even if you’re fully betting on character, you still need to know whether that character generalizes to (say) the low-oversight / conflicting-incentive / multi-agent region, and the only efficient way I know to find that out is via systematic, CDV-style sampling.
  I’d actually argue virtue ethics is the case where this matters most, not least. A character bet leaves the spec maximally implicit: “be of good character” says nothing explicit about price collusion or a mid-campaign rule change. Those are what I call spec bugs in the post—regions nobody thought to enumerate. So the more you lean on character rather than an enumerated behavioral spec, the more unmapped territory you have, and the more you need coverage discovery to surface it.
  Finally, you are absolutely right that “guarding against alignment deterioration during RL” is another important consideration. In fact, while writing the post I challenged myself to come up with techniques which have a chance of countering @evhub’s long-horizon RL fears (e.g. the AI CEO).