frisby comments on Safe Distillation With a Powerful Untrusted AI

frisby 20 Feb 2025 21:46 UTC
1 point
0
A couple small notes / clarifications about the results in this post:
- Theorem 1 could be equivalently stated “For any constant $c$ , the family of size- $n^{c}$ circuits is Verifiably Acceptable”. (If $G$ is parameterized by descriptions that can be evaluated in polynomial time, then you can wlog think of $G$ as a subfamily of polynomial-sized circuits.)
- It’s worth emphasizing that while $x^{*}$ (the location of the backdoor) is chosen uniformly random here, both $g$ (the function we’d like to distill, but can only access as a black-box) and $f$ (the untrusted distilled model handed to us) can be arbitrary functions (as long as they have small circuits). So, while we’re thinking of $g$ as coming from “nature”, in the model as stated we can consider the adversary as choosing both $f$ and $g$ .
- If desired, the Human can send the AI their random coins for the Goldwasser-Sipser protocol up-front during the “red teaming” phase. This means the “small amount of communication” required during inference is indeed quite small: the AI simply has to read $x$ and output a proof, without any further rounds of interaction between the Human and AI necessary. (Of course, one might still be unhappy about the fact that we’re asking the AI to generate these proofs for us at all at inference time.)