I don’t think this is evidence that values are low-dimensional in the sense of having low description length. It shows that the models in question contains a one-dimensional subspace that indicates how things in the model’s current thoughts are judged along some sort of already known goodness axes, not that the goodness axis itself is an algorithmically simple object. The floats that make up that subspace don’t describe goodness, they rely on the models’ pre-existing understanding of goodness to work. I’d guess the models also have only one or a very small number of directions for ‘elephant’, that doesn’t mean ‘elephant’ is a concept you could communicate with a single 16-bit float to an alien who’s never heard of elephants. The ‘feature dimension’ here is not the feature dimension relevant for predicting how many data samples it takes a mind to learn about goodness, or learn about elephants.
Yeah perhaps I was a bit glib here. Let’s break this down in some more detail.
The vector in the residual stream does not describe goodness intrinsically. Instead it provides a vector that represents the degree of goodness which can be manipulated in a linear fashion—i.e. it is a projection of our concept of goodness onto a linearly represented scale.
The actual model’s conception of goodness presumably lies in the combination of the columns of all the weight matrices that the ‘goodness’ direction activates. I.e. if we rotate the model’s weights such that the ‘goodness’ direction is an eigenvector then the goodness direction activates one column of the weight matrix in e.g. the first mlp layer. These weights encode some correlation structure between the ‘goodness’ direction/subspace and other directions/subspaces the model has learnt.
This correlation structure in the weights presumably encodes what ‘the model thinks of goodness’. How large this is I’m not sure but it can’t be that massive as e.g. some of these models aren’t that big e.g. 8B and they represent a lot of other information as well. E.g. some decent approximation of goodness is encodable in <=8GB at fp8. It would be really interesting to see if we could somehow quantify ‘the fraction of the model weights that deal with X’ because that is really our encoding length of the concept.
Maybe an interesting way to think about this as some kind of imaginary PCA over possible actions and their intrinsic goodness according to our rankings. I would imagine that, like many things, a very large degree of the ‘variance’ can be explained by the first N PCA factors where N is pretty small like <10. Certainly there will probably be a ‘generic goodness’ direction similar to the emergent misalignment one we find in LLMs plus additional increasing subtle opposing conceptions of goodness. I guess the way to phrase this is at what N will be the ‘elbow’ in this hypothetical PCA plot. My suspicion is that the LLM is probably doing something morally similar and we will be able to find additional ‘goodness’ vectors which handle different aspects of how we perceive goodness and the generic ‘goodness’ vector we find with emergent misalignment studies is just the first principal component in this space. However my prior is that the N factors that explain e.g. 95% of the variance is that N is not going to be that big.
Then what would you say if actions prescribed by ethics were hard to tell apart from the ones which are prescribed by some version of decision theory? Would communicating ethics to an alien be far harder than explaining the decision theory plus aesthetic-like values of the human community with which the alien communicates?
I don’t think this is evidence that values are low-dimensional in the sense of having low description length. It shows that the models in question contains a one-dimensional subspace that indicates how things in the model’s current thoughts are judged along some sort of already known goodness axes, not that the goodness axis itself is an algorithmically simple object. The floats that make up that subspace don’t describe goodness, they rely on the models’ pre-existing understanding of goodness to work. I’d guess the models also have only one or a very small number of directions for ‘elephant’, that doesn’t mean ‘elephant’ is a concept you could communicate with a single 16-bit float to an alien who’s never heard of elephants. The ‘feature dimension’ here is not the feature dimension relevant for predicting how many data samples it takes a mind to learn about goodness, or learn about elephants.
Yeah perhaps I was a bit glib here. Let’s break this down in some more detail.
The vector in the residual stream does not describe goodness intrinsically. Instead it provides a vector that represents the degree of goodness which can be manipulated in a linear fashion—i.e. it is a projection of our concept of goodness onto a linearly represented scale.
The actual model’s conception of goodness presumably lies in the combination of the columns of all the weight matrices that the ‘goodness’ direction activates. I.e. if we rotate the model’s weights such that the ‘goodness’ direction is an eigenvector then the goodness direction activates one column of the weight matrix in e.g. the first mlp layer. These weights encode some correlation structure between the ‘goodness’ direction/subspace and other directions/subspaces the model has learnt.
This correlation structure in the weights presumably encodes what ‘the model thinks of goodness’. How large this is I’m not sure but it can’t be that massive as e.g. some of these models aren’t that big e.g. 8B and they represent a lot of other information as well. E.g. some decent approximation of goodness is encodable in <=8GB at fp8. It would be really interesting to see if we could somehow quantify ‘the fraction of the model weights that deal with X’ because that is really our encoding length of the concept.
Maybe an interesting way to think about this as some kind of imaginary PCA over possible actions and their intrinsic goodness according to our rankings. I would imagine that, like many things, a very large degree of the ‘variance’ can be explained by the first N PCA factors where N is pretty small like <10. Certainly there will probably be a ‘generic goodness’ direction similar to the emergent misalignment one we find in LLMs plus additional increasing subtle opposing conceptions of goodness. I guess the way to phrase this is at what N will be the ‘elbow’ in this hypothetical PCA plot. My suspicion is that the LLM is probably doing something morally similar and we will be able to find additional ‘goodness’ vectors which handle different aspects of how we perceive goodness and the generic ‘goodness’ vector we find with emergent misalignment studies is just the first principal component in this space. However my prior is that the N factors that explain e.g. 95% of the variance is that N is not going to be that big.
Then what would you say if actions prescribed by ethics were hard to tell apart from the ones which are prescribed by some version of decision theory? Would communicating ethics to an alien be far harder than explaining the decision theory plus aesthetic-like values of the human community with which the alien communicates?
Yes, the description length of each dimension can still be high, but not arbitrarily high.