An embedding decoder model, trained with a different objective on a different dataset, can decode another model’s embeddings surprisingly accurately

Seems pretty relevant to the Natural Categories hypothesis.

P.S.

My current favorite story for “how we solve alignment” is

Solve the natural categories hypothesis
Add corrigibility
Combine these to build an AI that “does what I meant, not what I said”
Distribute the code/a foundation model for such an AI as widely as possible so it becomes the default whenever anyone is building a AI
Build some kind of “coalition of the willing” to make sure that human-compatible AI always has big margin of advantage in terms of computation