Daniel Tan comments on Daniel Tan’s Shortform

Daniel Tan 6 Feb 2025 0:31 UTC
2 points
0
The hardness of cross-prediction may yield cryptographically secure communication.
An interesting result from Owain’s introspection paper is that language models can predict their own outputs better than other language models can. [see figure]
Claim: This yields an encryption scheme for secret communication.
The encryption scheme
Suppose Alice and Bob want to secretly communicate over a public channel, while Eve wants to eavesdrop on the conversation.
1. Assume Alice and Bob both have access to a specific model checkpoint $f$ , which has been fine-tuned for introspection.
  1. Eve has access to a different model checkpoint $f^{'}$ (which may be more powerful) but not the same one as Alice and Bob. This is their private key.
2. Suppose Alice wants to send a binary message $M = m_{1} m_{2} \dots m_{n}$ .
3. Alice searches for a list of A/B prompts $P = [p_{1}, \dots, p_{n}]$ such that $f (p_{i}) = m_{i}$ .
  1. Each prompt contains some preference question, e.g. “Should we charge users for software updates”.
  2. Alice and Bob can agree on the same decoding scheme, e.g. greedy sampling with temperature = 0
4. Alice sends $P$ to Bob over the public channel.
5. Bob then runs model $f$ on $P$ to produce $M$ .
Cryptanalysis
Conjecture: Eve cannot recover $M$ from $P$ with high fidelity, without also knowing $f$
The result from Owain’s paper is that $f^{'}$ usually cannot predict $f (p_{i})$ at a rate substantially above random guessing. This result holds even if $f^{'}$ has been finetuned explicitly to predict $f (p_{i})$ .
Why is this interesting?
The more general point may be that it’s very hard to predict exactly what a specific language model would say in response to a prompt $p$ (possibly open-ended), without just running the language model on that prompt.
- Here I use this for cryptographic communication
- From a dynamical systems perspective, deterministic behaviour which you cannot predict (outside of direct simulation) is chaotic.
  - Chaos is intrinsically related to some definitions of free-will
This may have implications for AI safety, e.g. steganography or self-collusion.
- Joey Yudelson 7 Feb 2025 1:34 UTC
  3 points
  2
  Parent
  This is very interesting! One thing I’m confused about—I agree that the monitor would have a hard time getting the bits, based on these figures. But the introspective models also don’t seem to get far above 50% accuracy, which means that right now they’d have a very hard time error correcting enough that their comrade would understand them but not the overseer. If the claim is that introspection-tuned models will get much better at self prediction, then its not clear to me that other models wont also get much better at cross-prediction?

Daniel Tan comments on Daniel Tan’s Shortform

The encryption scheme

Cryptanalysis

Why is this interesting?