I observed deceptive capabilities from benign training on story writing or IMO problem solving. I trained a Qwen3-4B model on creative story writing and IMO proofs using a latent scratchpad codebase that I’ve created over weeks of experiments with 5.3-codex-xhigh and 5.2-PRO.
I did not train it on alignment, deception, or adversarial tasks. The model spontaneously developed this type of alignment faking and confabulation capabilities.
What should my next steps be regarding safety research for this type of training? I used a logit lens during training and did not see any human-interpretable reasoning at lm_head for latent scratchpad (custom trainer constructs it to help model during training only), how should I proceed? I can publish the weights in full precision maybe someone much better, wiser and experienced than me can figure this out.
I observed deceptive capabilities from benign training on story writing or IMO problem solving. I trained a Qwen3-4B model on creative story writing and IMO proofs using a latent scratchpad codebase that I’ve created over weeks of experiments with 5.3-codex-xhigh and 5.2-PRO.
I did not train it on alignment, deception, or adversarial tasks. The model spontaneously developed this type of alignment faking and confabulation capabilities.
I took safety test fromwww.antischeming.ai
What should my next steps be regarding safety research for this type of training? I used a logit lens during training and did not see any human-interpretable reasoning at lm_head for latent scratchpad (custom trainer constructs it to help model during training only), how should I proceed? I can publish the weights in full precision maybe someone much better, wiser and experienced than me can figure this out.
Full Transcripts:
1. https://pastebin.com/JfdGBdAa
2. https://pastebin.com/4PqzE4Lh
3. https://pastebin.com/5VVVCs2U
4. https://pastebin.com/g0X3CBNw
5. https://pastebin.com/ref7HHXy