TLDR: Did some experiments on Qwen3-TTS with some neat results. Not exactly sure what would be interesting to target for a safety perspective, interested in if people have any takes or ideas.
What have I done so far (and how did I build this intuition)?
I took real speech (LibriSpeech with 10 speakers and 5 clips each), ran it through the Qwen3-TTS 12 Hz tokenizer, and got the 16 discrete code streams. Then I treated each layer as a representation and asked simple questions.
Probing. I trained a basic classifier to predict speaker ID from each layer’s codes. From Layer 0 it gets about 10% accuracy (chance for 10 speakers, p = 0.55). From Layer 1 it gets 30%ish (p<0.001), and Layer 2 is similar. Later layers don’t add much more.
Ablations. I did some “causal” (i say this loosely) checks instead of just probing:
Zeroing out individual layers and re-running the classifier: removing Layer 1 hurts speaker prediction the most; removing Layer 0 slightly helps.
Swapping layers between speakers: swapping L1–L2 moves speaker embeddings most of the way (about 80%) toward the target speaker, while swapping L0 mainly changes the words.
Timescales. I also looked at timescales and capacity: Layer 0 codes persist longer in time and use much less of the codebook; higher layers change faster and use most of the vocabulary, which was consistent my intuition of phonemes vs acoustic texture.
Logit Lens. Separately, I poked the text to audio transformer with logit lens. Style prompts barely affect early layers and peak in the middle (around 13-20th layer), which suggests prosody is added late.
Having played with it for awhile, I have some intuition that:
Layer 0: What you say (phonemes, words)
Layers 1-2: Who you are (speaker identity)
Layers 3-15: How you say it (prosody, style, acoustic texture)
That all seems pretty clean, which makes me curious whether there are alignment/safety experiments worth doing on top of this structure. Any thoughts?
Interpy experiments on Qwen3-TTS.
TLDR: Did some experiments on Qwen3-TTS with some neat results. Not exactly sure what would be interesting to target for a safety perspective, interested in if people have any takes or ideas.
What have I done so far (and how did I build this intuition)?
I took real speech (LibriSpeech with 10 speakers and 5 clips each), ran it through the Qwen3-TTS 12 Hz tokenizer, and got the 16 discrete code streams. Then I treated each layer as a representation and asked simple questions.
Probing. I trained a basic classifier to predict speaker ID from each layer’s codes. From Layer 0 it gets about 10% accuracy (chance for 10 speakers, p = 0.55). From Layer 1 it gets 30%ish (p<0.001), and Layer 2 is similar. Later layers don’t add much more.
Ablations. I did some “causal” (i say this loosely) checks instead of just probing:
Zeroing out individual layers and re-running the classifier: removing Layer 1 hurts speaker prediction the most; removing Layer 0 slightly helps.
Swapping layers between speakers: swapping L1–L2 moves speaker embeddings most of the way (about 80%) toward the target speaker, while swapping L0 mainly changes the words.
Timescales. I also looked at timescales and capacity: Layer 0 codes persist longer in time and use much less of the codebook; higher layers change faster and use most of the vocabulary, which was consistent my intuition of phonemes vs acoustic texture.
Logit Lens. Separately, I poked the text to audio transformer with logit lens. Style prompts barely affect early layers and peak in the middle (around 13-20th layer), which suggests prosody is added late.
Having played with it for awhile, I have some intuition that:
Layer 0: What you say (phonemes, words) Layers 1-2: Who you are (speaker identity) Layers 3-15: How you say it (prosody, style, acoustic texture)
That all seems pretty clean, which makes me curious whether there are alignment/safety experiments worth doing on top of this structure. Any thoughts?