Can we utilize meaningful embedding dimensions as an alignment tool?
In toy models, embedding dimensions are meaningful and can represent features such as height, home, or feline. However, in large-scale real-world models, many (like 4096) dimensions are generated automatically, and their meanings remain unknown, hindering interpretability.
I propose the creation of a standardized set of embedding dimensions that: a) correspond to a known list of features, and b) incorporate critical dimensions such as deception, risk, alignment, and non-desirable content, including sexual themes.
Since large language models (LLMs) cannot engage in deception without recognizing it, any deceptive thoughts would register higher levels on the deception dimension. This could then trigger internal alarms, potentially enhancing the model’s alignment.
Can we utilize meaningful embedding dimensions as an alignment tool?
In toy models, embedding dimensions are meaningful and can represent features such as height, home, or feline. However, in large-scale real-world models, many (like 4096) dimensions are generated automatically, and their meanings remain unknown, hindering interpretability.
I propose the creation of a standardized set of embedding dimensions that: a) correspond to a known list of features, and b) incorporate critical dimensions such as deception, risk, alignment, and non-desirable content, including sexual themes.
Since large language models (LLMs) cannot engage in deception without recognizing it, any deceptive thoughts would register higher levels on the deception dimension. This could then trigger internal alarms, potentially enhancing the model’s alignment.