Latent Reasoning Sprint #2: Token-Based Signals and Linear Probes

In my previous post I applied logit lens and tuned lens to CODI’s latent reasoning chain and found evidence consistent with the scratchpad paper’s compute/​store alternation hypothesis — even steps showing higher intermediate answer detection and odd steps showing higher entropy along with results matching “Can we interpret latent reasoning using current mechanistic interpretability tools?”.

This post looked a little deeper looking into some of the questions I had when making the previous post along with new findings that I did not expect.

TL;DR

  • The CODI tuned lens fails to generalize to non-CODI activations

  • Final answer detection at even steps 2 and 4 is robust across top-k values 1-10, ruling out threshold artifacts

  • “Therefore” appears specifically after latent step 3, with detection rate increasing after latent 3 all the way up to 6 — suggesting step 3 and step 5 are qualitatively different despite both being odd computation steps

  • A linear probe trained to distinguish intermediate vs final answer representations activates most strongly at odd steps 3 and 5, consistent with the scratchpad paper’s compute/​store alternation — but with step 3 peaking higher than step 5, revealing asymmetry between odd steps

Experimental setup

CoDI model

I use the publicly available CODI Llama 3.2 1B checkpoint from Can we interpret latent reasoning using current mechanistic interpretability tools?

Tuned Logit Lens

In this experiment I in order to create my tuned logit lens implementation I used the code implementation for the training of Tuned logit lens from Eliciting Latent Predictions from Transformers with the Tuned Lens

Experiments

Confirming Previous Assumptions

Codi Tuned Lens Generalization Failure

In the previous post I had speculation that the codi tuned lens while working on codi tokens failed to generalize to non codi activations.

For this figure I used the tuned logit lens trained on latents 1-6

Confirming the Intermediate answer detection

I tried looking at the intermediate answer detection across different topk values from 1 − 10 as maybe certain latents might have a higher answer detection rate because of the specific topk value.

Regardless of the tuned logit lens approach final answer detection peaked at latents 2 and 4

No comments.