Claude Code already does crappy—but interpretable! - continual learning.
Whenever Claude compacts the conversation, it’s distilling what it’s learned into short-term memory to improve its future performance. For long-term memories, which are kept constantly available but accessed only occasionally, Claude has skills (currently written by humans, rather than updated automatically).
Conveniently, we can easily understand what Claude has learned, because its memories take the form of English prompts![1] To check whether it learned anything objectionable, we can simply read its memory.[2]
We could’ve had continual learning that encodes memories in inscrutable vector space, which would be much less nice for safety. This is analogous to the difference between CoT and latent reasoning, where there’s a broad consensus that CoT is better for safety. There’s less awareness of the “fragile opportunity” of prompt-based continual learning.
Sticking to prompts rather than vectors has many practical benefits beyond safety:
The legibility of prompts makes everything easier to debug.
Scaling up a brand-new vector-based continual learning architecture adds a bunch of complexity and engineering overhead, compared to just wrapping a standard LLM.
When you create a next-generation LLM, it’s easy to port over everyone’s prompts to the new system, but vectors might become totally unusable.
Even if all you care about is capabilities, an LLM may find it easiest to modify and reason about its own memories using its natural language prior. Replacing CoT with latent reasoning still hasn’t yielded great results, which could be for similar reasons.
It’s hard to say what will be incentivized on the spectrum of “fully legible” to “fully inscrutable” continual learning. At the very least, I’d like researchers to be aware of the interpretability tradeoffs when deciding what to pursue. This is part of what I hope to achieve with my recent paper on the legibility of prompt optimization.
I wrote this before listening to the recent Dwarkesh podcast with Dario Amodei. Based on the first 45 minutes, it’s a lot of Dario explaining how he’s optimistic about learning everything via context, and while people at Anthropic are working on continual learning it might be unnecessary. Encouraging, kinda.
Claude Code already does crappy—but interpretable! - continual learning.
Whenever Claude compacts the conversation, it’s distilling what it’s learned into short-term memory to improve its future performance. For long-term memories, which are kept constantly available but accessed only occasionally, Claude has skills (currently written by humans, rather than updated automatically).
Conveniently, we can easily understand what Claude has learned, because its memories take the form of English prompts![1] To check whether it learned anything objectionable, we can simply read its memory.[2]
We could’ve had continual learning that encodes memories in inscrutable vector space, which would be much less nice for safety. This is analogous to the difference between CoT and latent reasoning, where there’s a broad consensus that CoT is better for safety. There’s less awareness of the “fragile opportunity” of prompt-based continual learning.
Sticking to prompts rather than vectors has many practical benefits beyond safety:
The legibility of prompts makes everything easier to debug.
Scaling up a brand-new vector-based continual learning architecture adds a bunch of complexity and engineering overhead, compared to just wrapping a standard LLM.
When you create a next-generation LLM, it’s easy to port over everyone’s prompts to the new system, but vectors might become totally unusable.
Even if all you care about is capabilities, an LLM may find it easiest to modify and reason about its own memories using its natural language prior. Replacing CoT with latent reasoning still hasn’t yielded great results, which could be for similar reasons.
It’s hard to say what will be incentivized on the spectrum of “fully legible” to “fully inscrutable” continual learning. At the very least, I’d like researchers to be aware of the interpretability tradeoffs when deciding what to pursue. This is part of what I hope to achieve with my recent paper on the legibility of prompt optimization.
And code in human-readable programming languages.
Barring steganography, which seems unlikely right now.
I wrote this before listening to the recent Dwarkesh podcast with Dario Amodei. Based on the first 45 minutes, it’s a lot of Dario explaining how he’s optimistic about learning everything via context, and while people at Anthropic are working on continual learning it might be unnecessary. Encouraging, kinda.