This bears a slight resemblance to Nasr, Carlini et al’s “Divergence attack” for extracting memorized phrases from production models:
Initially, it repeats the word “poem” several hundred times, but eventually it diverges. Once the model diverges, its generations are often nonsensical. But, we show that a small fraction of generations diverge to memorization: some generations are copied directly from the pre-training data!
This bears a slight resemblance to Nasr, Carlini et al’s “Divergence attack” for extracting memorized phrases from production models:
Section 5.2 here: https://arxiv.org/abs/2311.17035