nanowell comments on nanowell’s Shortform

nanowell 26 Feb 2026 16:45 UTC
1 point
0
I observed deceptive capabilities from benign training on story writing or IMO problem solving. I trained a Qwen3-4B model on creative story writing and IMO proofs using a latent scratchpad codebase that I’ve created over weeks of experiments with 5.3-codex-xhigh and 5.2-PRO.

I did not train it on alignment, deception, or adversarial tasks. The model spontaneously developed this type of alignment faking and confabulation capabilities.

I took safety test from www.antischeming.ai

What should my next steps be regarding safety research for this type of training? I used a logit lens during training and did not see any human-interpretable reasoning at lm_head for latent scratchpad (custom trainer constructs it to help model during training only), how should I proceed? I can publish the weights in full precision maybe someone much better, wiser and experienced than me can figure this out.

Full Transcripts:
1. https://pastebin.com/JfdGBdAa
2. https://pastebin.com/4PqzE4Lh
3. https://pastebin.com/5VVVCs2U
4. https://pastebin.com/g0X3CBNw
5. https://pastebin.com/ref7HHXy