By my read, you’re updating your beliefs (somewhat) away from “corrigibility should be the alignment target” and toward “constitutional AI will work”. What is the reason for that update? As far as I can tell, the evidence we have is basically (1) Anthropic is trying to align AI via the Constitution; (2) constitutionally-aligned Claude scores pretty well on superficial “alignment” benchmarks. I take this as basically epsilon evidence that Anthropic’s strategy will work for superintelligence, so I want to hear more about what evidence you’re updating on.
Very little Bayesian evidence.
I saw new signs that my reasoning was incomplete. I had been generalizing from many examples of approaches that did a poor job of prioritizing corrigibility, but I never had an airtight argument for it being impossible to mix corrigibility with other goals.
Tell me if this is an accurate description of your reasoning:
I thought it was not feasible to mix corrigibility with value alignment—we should aim for CAST instead.
I saw how Claude’s Constitution tries to mix corrigibility with values.
I don’t necessarily think the constitution is doing a good job at that, but it made me realize that I was too hasty to rule out the feasibility of mixing corrigibility with values.
By my read, you’re updating your beliefs (somewhat) away from “corrigibility should be the alignment target” and toward “constitutional AI will work”. What is the reason for that update? As far as I can tell, the evidence we have is basically (1) Anthropic is trying to align AI via the Constitution; (2) constitutionally-aligned Claude scores pretty well on superficial “alignment” benchmarks. I take this as basically epsilon evidence that Anthropic’s strategy will work for superintelligence, so I want to hear more about what evidence you’re updating on.
Very little Bayesian evidence. I saw new signs that my reasoning was incomplete. I had been generalizing from many examples of approaches that did a poor job of prioritizing corrigibility, but I never had an airtight argument for it being impossible to mix corrigibility with other goals.
Tell me if this is an accurate description of your reasoning:
Yes.