MichaelDickens comments on Claude’s Constitution

MichaelDickens 26 Apr 2026 1:07 UTC
4 points
2
By my read, you’re updating your beliefs (somewhat) away from “corrigibility should be the alignment target” and toward “constitutional AI will work”. What is the reason for that update? As far as I can tell, the evidence we have is basically (1) Anthropic is trying to align AI via the Constitution; (2) constitutionally-aligned Claude scores pretty well on superficial “alignment” benchmarks. I take this as basically epsilon evidence that Anthropic’s strategy will work for superintelligence, so I want to hear more about what evidence you’re updating on.
- PeterMcCluskey 27 Apr 2026 17:17 UTC
  4 points
  0
  Parent
  Very little Bayesian evidence. I saw new signs that my reasoning was incomplete. I had been generalizing from many examples of approaches that did a poor job of prioritizing corrigibility, but I never had an airtight argument for it being impossible to mix corrigibility with other goals.
  - MichaelDickens 27 Apr 2026 18:21 UTC
    2 points
    0
    Parent
    Tell me if this is an accurate description of your reasoning:
    
    I thought it was not feasible to mix corrigibility with value alignment—we should aim for CAST instead.
    I saw how Claude’s Constitution tries to mix corrigibility with values.
    I don’t necessarily think the constitution is doing a good job at that, but it made me realize that I was too hasty to rule out the feasibility of mixing corrigibility with values.
    - PeterMcCluskey 28 Apr 2026 2:27 UTC
      4 points
      0
      Parent
      Yes.