Reasoning in non-legible latent spaces is a risk that can be addressed by making the latent spaces independently human interpretable. One such method is LatentQA: Teaching LLMs to Decode Activations Into Natural Language. Such methods have the advantage of making not only the output layer human-readable but potentially more parts of the model.
Reasoning in non-legible latent spaces is a risk that can be addressed by making the latent spaces independently human interpretable. One such method is LatentQA: Teaching LLMs to Decode Activations Into Natural Language. Such methods have the advantage of making not only the output layer human-readable but potentially more parts of the model.