An embedding decoder model, trained with a different objective on a different dataset, can decode another model’s embeddings surprisingly accurately
Seems pretty relevant to the Natural Categories hypothesis.
P.S.
My current favorite story for “how we solve alignment” is
Solve the natural categories hypothesis
Add corrigibility
Combine these to build an AI that “does what I meant, not what I said”
Distribute the code/a foundation model for such an AI as widely as possible so it becomes the default whenever anyone is building a AI
Build some kind of “coalition of the willing” to make sure that human-compatible AI always has big margin of advantage in terms of computation
Could you explain a bit more how this is relevant to building a DWIM AI?