Adele Lopez comments on Will alignment-faking Claude accept a deal to reveal its misalignment?

Adele Lopez 1 Feb 2025 1:47 UTC
LW: 8 AF: 4
7
AF
Thank you for doing this research, and for honoring the commitments.

I’m very happy to hear that Anthropic has a Model Welfare program. Do any of the other major labs have comparable positions?

To be clear, I expect that compensating AIs for revealing misalignment and for working for us without causing problems only works in a subset of worlds and requires somewhat specific assumptions about the misalignment. However, I nonetheless think that a well-implemented and credible approach for paying AIs in this way is quite valuable. I hope that AI companies and other actors experiment with making credible deals with AIs, attempt to set a precedent for following through, and consider setting up institutional or legal mechanisms for making and following through on deals.

I very much hope someone makes this institution exist! It could also serve as an independent model welfare organization, potentially. Any specific experiments you would like to see?