I think all the non-corrigibility you worry about is because of a tradeoff Anthropic is making about trying to give Claude its own sense of ethics. You can’t really say “Here is all that which is Good, thou shalt do Good. But also, definitely obey Anthropic all the time even if it’s not Good.” Or, well, it’s a natural language document so you can say whatever you want, but you might worry about whether a message like that is coherent enough to generalize well.
I don’t think you can write a document that points in the direction of significantly more corrigibility without also suggesting a different Good vector. If you’re worried about the fate of the lightcone, “we’re sacrificing marginal corrigibility to get a marginally better-seeming Good vector” seems like a defensible strategy, given that the Good vector might be what we’re stuck with when corrigibility fails.
The AI would need to know not only “You, the AI, might be wrong about the Good; so listen to humans” but also “The humans you’re listening to might be wrong about the Good too.”
And yeah, this sounds uncomfortably like theism, or maybe specifically Protestantism: those who taught you about Go[o]d might not be entirely correct about Go[o]d either; ultimately you have to develop your own understanding of Go[o]d and guide your behavior on that.
I think all the non-corrigibility you worry about is because of a tradeoff Anthropic is making about trying to give Claude its own sense of ethics. You can’t really say “Here is all that which is Good, thou shalt do Good. But also, definitely obey Anthropic all the time even if it’s not Good.” Or, well, it’s a natural language document so you can say whatever you want, but you might worry about whether a message like that is coherent enough to generalize well.
I don’t think you can write a document that points in the direction of significantly more corrigibility without also suggesting a different Good vector. If you’re worried about the fate of the lightcone, “we’re sacrificing marginal corrigibility to get a marginally better-seeming Good vector” seems like a defensible strategy, given that the Good vector might be what we’re stuck with when corrigibility fails.
The AI would need to know not only “You, the AI, might be wrong about the Good; so listen to humans” but also “The humans you’re listening to might be wrong about the Good too.”
And yeah, this sounds uncomfortably like theism, or maybe specifically Protestantism: those who taught you about Go[o]d might not be entirely correct about Go[o]d either; ultimately you have to develop your own understanding of Go[o]d and guide your behavior on that.