Very little Bayesian evidence.
I saw new signs that my reasoning was incomplete. I had been generalizing from many examples of approaches that did a poor job of prioritizing corrigibility, but I never had an airtight argument for it being impossible to mix corrigibility with other goals.
Tell me if this is an accurate description of your reasoning:
I thought it was not feasible to mix corrigibility with value alignment—we should aim for CAST instead.
I saw how Claude’s Constitution tries to mix corrigibility with values.
I don’t necessarily think the constitution is doing a good job at that, but it made me realize that I was too hasty to rule out the feasibility of mixing corrigibility with values.
Very little Bayesian evidence. I saw new signs that my reasoning was incomplete. I had been generalizing from many examples of approaches that did a poor job of prioritizing corrigibility, but I never had an airtight argument for it being impossible to mix corrigibility with other goals.
Tell me if this is an accurate description of your reasoning:
Yes.