Neel Nanda comments on Circuit discovery through chain of thought using policy gradients

Neel Nanda 29 Nov 2025 11:10 UTC
LW: 4 AF: 3
0
AF
Interesting. Thanks for writing up the post. I’m reasonably persuaded that this might work, though I am concerned that long chains of RL are just sufficiently fucked functions with enough butterfly effects that this wouldn’t be well approximated by this process. I would be interested to see someone try though
- Adrià Garriga-alonso 30 Nov 2025 1:15 UTC
  LW: 2 AF: 1
  0
  AF Parent
  I am concerned that long chains of RL are just sufficiently fucked functions with enough butterfly effects that this wouldn’t be well approximated by this process.
  
  This is a concern. Two possible replies:
  - If it’s truly a chaotic system then there’s no good way to estimate the expectation.
  - In reality, it could be that the effects of neurons are not very chaotic, but this estimate of the gradient is very chaotic. Previous work actually shows that policy gradients are much less chaotic than the ‘reparameterization trick’ (in the case where the transition is continuous, differentiating through it). It could be that finite differences (resampling many rollouts with/without the neuron activated) actually estimates effects better with less variance. We’ll see.