The hardness of cross-prediction may yield cryptographically secure communication.
An interesting result from Owain’s introspection paper is that language models can predict their own outputs better than other language models can. [see figure]
Claim: This yields an encryption scheme for secret communication.
The encryption scheme
Suppose Alice and Bob want to secretly communicate over a public channel, while Eve wants to eavesdrop on the conversation.
Assume Alice and Bob both have access to a specific model checkpoint f, which has been fine-tuned for introspection.
Eve has access to a different model checkpoint f′ (which may be more powerful) but not the same one as Alice and Bob. This is their private key.
Suppose Alice wants to send a binary message M=m1m2⋯mn.
Alice searches for a list of A/B prompts P=[p1,⋯,pn] such that f(pi)=mi.
Each prompt contains some preference question, e.g. “Should we charge users for software updates”.
Alice and Bob can agree on the same decoding scheme, e.g. greedy sampling with temperature = 0
Alice sends P to Bob over the public channel.
Bob then runs model f on P to produce M.
Cryptanalysis
Conjecture: Eve cannot recover M from P with high fidelity, without also knowing f
The result from Owain’s paper is that f′ usually cannot predict f(pi) at a rate substantially above random guessing. This result holds even if f′ has been finetuned explicitly to predict f(pi).
Why is this interesting?
The more general point may be that it’s very hard to predict exactly what a specific language model would say in response to a prompt p (possibly open-ended), without just running the language model on that prompt.
Here I use this for cryptographic communication
From a dynamical systems perspective, deterministic behaviour which you cannot predict (outside of direct simulation) is chaotic.
Chaos is intrinsically related to some definitions of free-will
This may have implications for AI safety, e.g. steganography or self-collusion.
This is very interesting! One thing I’m confused about—I agree that the monitor would have a hard time getting the bits, based on these figures. But the introspective models also don’t seem to get far above 50% accuracy, which means that right now they’d have a very hard time error correcting enough that their comrade would understand them but not the overseer. If the claim is that introspection-tuned models will get much better at self prediction, then its not clear to me that other models wont also get much better at cross-prediction?
The hardness of cross-prediction may yield cryptographically secure communication.
An interesting result from Owain’s introspection paper is that language models can predict their own outputs better than other language models can. [see figure]
Claim: This yields an encryption scheme for secret communication.
The encryption scheme
Suppose Alice and Bob want to secretly communicate over a public channel, while Eve wants to eavesdrop on the conversation.
Assume Alice and Bob both have access to a specific model checkpoint f, which has been fine-tuned for introspection.
Eve has access to a different model checkpoint f′ (which may be more powerful) but not the same one as Alice and Bob. This is their private key.
Suppose Alice wants to send a binary message M=m1m2⋯mn.
Alice searches for a list of A/B prompts P=[p1,⋯,pn] such that f(pi)=mi.
Each prompt contains some preference question, e.g. “Should we charge users for software updates”.
Alice and Bob can agree on the same decoding scheme, e.g. greedy sampling with temperature = 0
Alice sends P to Bob over the public channel.
Bob then runs model f on P to produce M.
Cryptanalysis
Conjecture: Eve cannot recover M from P with high fidelity, without also knowing f
The result from Owain’s paper is that f′ usually cannot predict f(pi) at a rate substantially above random guessing. This result holds even if f′ has been finetuned explicitly to predict f(pi).
Why is this interesting?
The more general point may be that it’s very hard to predict exactly what a specific language model would say in response to a prompt p (possibly open-ended), without just running the language model on that prompt.
Here I use this for cryptographic communication
From a dynamical systems perspective, deterministic behaviour which you cannot predict (outside of direct simulation) is chaotic.
Chaos is intrinsically related to some definitions of free-will
This may have implications for AI safety, e.g. steganography or self-collusion.
This is very interesting! One thing I’m confused about—I agree that the monitor would have a hard time getting the bits, based on these figures. But the introspective models also don’t seem to get far above 50% accuracy, which means that right now they’d have a very hard time error correcting enough that their comrade would understand them but not the overseer. If the claim is that introspection-tuned models will get much better at self prediction, then its not clear to me that other models wont also get much better at cross-prediction?