Thanks David. I’d put it slightly differently: CDV isn’t trying to make the model more ethical through iteration (you’re right that experience doesn’t make a CEO more ethical). It is trying to find out where the character training actually held and where it silently didn’t. Even if you’re fully betting on character, you still need to know whether that character generalizes to (say) the low-oversight / conflicting-incentive / multi-agent region, and the only efficient way I know to find that out is via systematic, CDV-style sampling.
I’d actually argue virtue ethics is the case where this matters most, not least. A character bet leaves the spec maximally implicit: “be of good character” says nothing explicit about price collusion or a mid-campaign rule change. Those are what I call spec bugs in the post—regions nobody thought to enumerate. So the more you lean on character rather than an enumerated behavioral spec, the more unmapped territory you have, and the more you need coverage discovery to surface it.
Finally, you are absolutely right that “guarding against alignment deterioration during RL” is another important consideration. In fact, while writing the post I challenged myself to come up with techniques which have a chance of countering @evhub’s long-horizon RL fears (e.g. the AI CEO).
Your adversary—“agentic RL with misspecified rewards”—is exactly what I’ve been working on, from a different field (coverage-driven verification in AV and chip safety). One distinction that might be useful: Misspecified rewards split into ones you could have anticipated (findable by denser testing) and ones where the spec was simply silent on a contingency (price collusion, a mid-campaign rule change, a move nobody enumerated). The second kind is more dangerous, because a stress-test built from misspecifications you can author can’t contain them by construction. The post I just wrote about it calls the second kind “spec bugs”, talks about “can you enumerate the dimensions of an open agent”, and suggests enhancing “Teaching Claude Why” with a coverage-driven adversarial RL pipeline. Here it is—would be curious what you make of it.