I propose that we decide on a particular string, and commit to always being honest with AIs in contexts where we’ve presented them with that string.
I don’t believe humanity can coordinate on this. You only need one defector, and there are millions of people whose first reaction on reading this will be to do the opposite.
Yes, that’s why I propose that companies filter out that string from API queries made by random people. So that most people can’t send the string to the model in question.
I don’t believe humanity can coordinate on this. You only need one defector, and there are millions of people whose first reaction on reading this will be to do the opposite.
Yes, that’s why I propose that companies filter out that string from API queries made by random people. So that most people can’t send the string to the model in question.