Ruby comments on Tracing the Thoughts of a Large Language Model

Ruby 29 Apr 2025 2:14 UTC
4 points
0
Curated. There’s a lot of interesting stuff in here and definitely bolsters my feeling that with enough years (could be quite a few) that eventually we’d figure out artificial minds and build powerful ones safely.

The language learning research feels in the direction of natural abstraction. The arithmetic experiment suggests a lack of introspection on Claude’s part (more so than deception), the features pressuring it to maintain coherence and outweighing safety training is perhaps amusing and disturbing. The lack of faithful reasoning is disturbing and should make us fear what’ll happen in cases where we can’t investigate. And of course, evidence of planning ahead is big.

I feel like I hear it less often these days, but the “just a next-token predictor” used to get invoked often as a limitation that meant LLMs could only be so dangerous. Evidence of planning ahead suggests a much more typical, and potentially agentic mind. I also liked the surfaced quote from Chris Olah that simply to give the name of someone who studies the starts, you have to think ahead to choose “an” instead of “a”, as a trivial case. The rabbit rhyme case is even stronger.

Subtract the existential concerns, and I’d relate to all this with fascination and excitement.