danwil comments on Edward James Young’s Shortform

danwil 19 Apr 2026 20:45 UTC
1 point
0
This bears a slight resemblance to Nasr, Carlini et al’s “Divergence attack” for extracting memorized phrases from production models:

Initially, it repeats the word “poem” several hundred times, but eventually it diverges. Once the model diverges, its generations are often nonsensical. But, we show that a small fraction of generations diverge to memorization: some generations are copied directly from the pre-training data!

Section 5.2 here: https://arxiv.org/abs/2311.17035