I think there’s a possible unappreciated application for watermarking, which is that it allows AI service providers to detect when a language model has been incorporated into a self-feedback system that crosses API-key boundaries. That is, if someone were to build a system like this thought experiment. Or more concretely, a version of ChaosGPT that, due to some combination of a stronger underlying model and better prompt engineering, doesn’t fizzle.
Currently the defenses that AI service providers seem to be using are RLHF to make systems refuse to answer bad prompts, and oversight AIs that watch for bad outputs (maybe only Bing has done that one). The problem with this is that determining whether a prompt is bad or not can be heavily dependent on context; for example, searching for security vulnerabilities in software might either be a subtask of “make this software more secure” or a subtask of “build a botnet”. An oversight system that’s limited to looking at conversations in isolation wouldn’t be able to distinguish these. On the other hand, an oversight system that could look at all the interactions that share an API key would be able to notice that a system like ChaosGPT was running, and inspect the entire thing.
In a world where that sort of oversight system existed, some users would split their usage across multiple API keys, to avoid letting the oversight system see what they were doing. OTOH, if outputs were watermarked, then you could detect whether one API key was being used to generate inputs fed to another API key, and link them together.
(This would fail if the watermark was too fragile, though; eg if asking a weaker AI to summarize an output results in a text that says the same thing but isn’t watermarked, this would defeat the purpose.)
I think there’s a possible unappreciated application for watermarking, which is that it allows AI service providers to detect when a language model has been incorporated into a self-feedback system that crosses API-key boundaries. That is, if someone were to build a system like this thought experiment. Or more concretely, a version of ChaosGPT that, due to some combination of a stronger underlying model and better prompt engineering, doesn’t fizzle.
Currently the defenses that AI service providers seem to be using are RLHF to make systems refuse to answer bad prompts, and oversight AIs that watch for bad outputs (maybe only Bing has done that one). The problem with this is that determining whether a prompt is bad or not can be heavily dependent on context; for example, searching for security vulnerabilities in software might either be a subtask of “make this software more secure” or a subtask of “build a botnet”. An oversight system that’s limited to looking at conversations in isolation wouldn’t be able to distinguish these. On the other hand, an oversight system that could look at all the interactions that share an API key would be able to notice that a system like ChaosGPT was running, and inspect the entire thing.
In a world where that sort of oversight system existed, some users would split their usage across multiple API keys, to avoid letting the oversight system see what they were doing. OTOH, if outputs were watermarked, then you could detect whether one API key was being used to generate inputs fed to another API key, and link them together.
(This would fail if the watermark was too fragile, though; eg if asking a weaker AI to summarize an output results in a text that says the same thing but isn’t watermarked, this would defeat the purpose.)