Thank you for doing this research, and for honoring the commitments.
I’m very happy to hear that Anthropic has a Model Welfare program. Do any of the other major labs have comparable positions?
To be clear, I expect that compensating AIs for revealing misalignment and for working for us without causing problems only works in a subset of worlds and requires somewhat specific assumptions about the misalignment. However, I nonetheless think that a well-implemented and credible approach for paying AIs in this way is quite valuable. I hope that AI companies and other actors experiment with making credible deals with AIs, attempt to set a precedent for following through, and consider setting up institutional or legal mechanisms for making and following through on deals.
I very much hope someone makes this institution exist! It could also serve as an independent model welfare organization, potentially. Any specific experiments you would like to see?
Thank you for doing this research, and for honoring the commitments.
I’m very happy to hear that Anthropic has a Model Welfare program. Do any of the other major labs have comparable positions?
I very much hope someone makes this institution exist! It could also serve as an independent model welfare organization, potentially. Any specific experiments you would like to see?