Yeah perhaps I was a bit glib here. Let’s break this down in some more detail.
The vector in the residual stream does not describe goodness intrinsically. Instead it provides a vector that represents the degree of goodness which can be manipulated in a linear fashion—i.e. it is a projection of our concept of goodness onto a linearly represented scale.
The actual model’s conception of goodness presumably lies in the combination of the columns of all the weight matrices that the ‘goodness’ direction activates. I.e. if we rotate the model’s weights such that the ‘goodness’ direction is an eigenvector then the goodness direction activates one column of the weight matrix in e.g. the first mlp layer. These weights encode some correlation structure between the ‘goodness’ direction/subspace and other directions/subspaces the model has learnt.
This correlation structure in the weights presumably encodes what ‘the model thinks of goodness’. How large this is I’m not sure but it can’t be that massive as e.g. some of these models aren’t that big e.g. 8B and they represent a lot of other information as well. E.g. some decent approximation of goodness is encodable in <=8GB at fp8. It would be really interesting to see if we could somehow quantify ‘the fraction of the model weights that deal with X’ because that is really our encoding length of the concept.
Maybe an interesting way to think about this as some kind of imaginary PCA over possible actions and their intrinsic goodness according to our rankings. I would imagine that, like many things, a very large degree of the ‘variance’ can be explained by the first N PCA factors where N is pretty small like <10. Certainly there will probably be a ‘generic goodness’ direction similar to the emergent misalignment one we find in LLMs plus additional increasing subtle opposing conceptions of goodness. I guess the way to phrase this is at what N will be the ‘elbow’ in this hypothetical PCA plot. My suspicion is that the LLM is probably doing something morally similar and we will be able to find additional ‘goodness’ vectors which handle different aspects of how we perceive goodness and the generic ‘goodness’ vector we find with emergent misalignment studies is just the first principal component in this space. However my prior is that the N factors that explain e.g. 95% of the variance is that N is not going to be that big.
Yeah perhaps I was a bit glib here. Let’s break this down in some more detail.
The vector in the residual stream does not describe goodness intrinsically. Instead it provides a vector that represents the degree of goodness which can be manipulated in a linear fashion—i.e. it is a projection of our concept of goodness onto a linearly represented scale.
The actual model’s conception of goodness presumably lies in the combination of the columns of all the weight matrices that the ‘goodness’ direction activates. I.e. if we rotate the model’s weights such that the ‘goodness’ direction is an eigenvector then the goodness direction activates one column of the weight matrix in e.g. the first mlp layer. These weights encode some correlation structure between the ‘goodness’ direction/subspace and other directions/subspaces the model has learnt.
This correlation structure in the weights presumably encodes what ‘the model thinks of goodness’. How large this is I’m not sure but it can’t be that massive as e.g. some of these models aren’t that big e.g. 8B and they represent a lot of other information as well. E.g. some decent approximation of goodness is encodable in <=8GB at fp8. It would be really interesting to see if we could somehow quantify ‘the fraction of the model weights that deal with X’ because that is really our encoding length of the concept.
Maybe an interesting way to think about this as some kind of imaginary PCA over possible actions and their intrinsic goodness according to our rankings. I would imagine that, like many things, a very large degree of the ‘variance’ can be explained by the first N PCA factors where N is pretty small like <10. Certainly there will probably be a ‘generic goodness’ direction similar to the emergent misalignment one we find in LLMs plus additional increasing subtle opposing conceptions of goodness. I guess the way to phrase this is at what N will be the ‘elbow’ in this hypothetical PCA plot. My suspicion is that the LLM is probably doing something morally similar and we will be able to find additional ‘goodness’ vectors which handle different aspects of how we perceive goodness and the generic ‘goodness’ vector we find with emergent misalignment studies is just the first principal component in this space. However my prior is that the N factors that explain e.g. 95% of the variance is that N is not going to be that big.