Cool stuff, I like this direction. Some random thoughts from having thought about this before are (not necessarily related to the BT-method and similar as that more seems like a nice way to potentially measure it.):
In order to get a good predictive orthogonal basis of values in LLMs I think you would need some sort of correlational clustering model for you need somewhere to start.
We want a robust basis for the thing that partly makes value learning of AIs useful is to know how they will act in the future and in order to know this we need to target robust features and that is easier if we can establish a set of values that are as non-correlated with each other as possible.
I think that shard theory makes sense for both humans and llms and that we have specific values like virtue style thinking that shows up in different contexts and so it is a good idea to potentially map them out.
As a consequence I think this paper by the Meaning Alignment Institute is quite undervalued because they construct a methodology to create a relational value graph that maps on really well to contextual value representations: https://arxiv.org/abs/2404.10636
The problem is of course that due to character level representations and simulacra you will have multiple different characters but it would be interesting to see the extent that a character shows up underneath the existing system.
There are a bunch of existing human value maps that are just about coming up with names for things and mapping data onto an orthogonal basis and then displaying that data, big 5 traits being one of them, there’s also some policy focused versions of these.
Anyways, just a bunch of ideas here but it would be really cool if someone continued this line of work.
Cool stuff, I like this direction. Some random thoughts from having thought about this before are (not necessarily related to the BT-method and similar as that more seems like a nice way to potentially measure it.):
In order to get a good predictive orthogonal basis of values in LLMs I think you would need some sort of correlational clustering model for you need somewhere to start.
We want a robust basis for the thing that partly makes value learning of AIs useful is to know how they will act in the future and in order to know this we need to target robust features and that is easier if we can establish a set of values that are as non-correlated with each other as possible.
I think that shard theory makes sense for both humans and llms and that we have specific values like virtue style thinking that shows up in different contexts and so it is a good idea to potentially map them out.
As a consequence I think this paper by the Meaning Alignment Institute is quite undervalued because they construct a methodology to create a relational value graph that maps on really well to contextual value representations: https://arxiv.org/abs/2404.10636
The problem is of course that due to character level representations and simulacra you will have multiple different characters but it would be interesting to see the extent that a character shows up underneath the existing system.
There are a bunch of existing human value maps that are just about coming up with names for things and mapping data onto an orthogonal basis and then displaying that data, big 5 traits being one of them, there’s also some policy focused versions of these.
Anyways, just a bunch of ideas here but it would be really cool if someone continued this line of work.