Yogesh Prabhu

Karma: 22

obsessed with interp

Yogesh Prabhu 24 Jun 2026 4:51 UTC
2 points
1
in reply to: Kurt H. Pieper’s comment on: Can You Hide From a Natural Language Autoencoder?
1. yes experiment 1 is optimizing over an activation vector
2. It was made sure by analyzing the response when the attack vector is patched back into the original model to see if the response changes, I used a judge model on a binary eval as to if it stays true to its original answer and we can see this in the table that Source model: answer label unchanged 99.6%

Can You Hide From a Natural Language Autoencoder?

Yogesh Prabhu24 Jun 2026 2:41 UTC

11 points

2 comments7 min readLW link

(yogesh.bearblog.dev)

The Dual-Use Gap

Yogesh Prabhu14 Jun 2026 17:43 UTC

5 points

2 comments4 min readLW link

(yogesh.bearblog.dev)

The Pragmatic Interpretability Trap

Yogesh Prabhu11 May 2026 4:06 UTC

7 points

0 comments3 min readLW link

(yogesh.bearblog.dev)