Current alignment methods like RLHF and Constitutional AI often use human evaluators. But these samples might not truly represent humanity or its diverse values. How can we model all human preferences with limited compute?
Maybe we should try collecting a vast dataset of written life experiences and beliefs from people worldwide.
We could then filter and augment this data to aim for better population representation. Vectorize these entries and use k-means to find ‘k’ representative viewpoints.
These ‘k’ vectors could then serve as a more representative proxy for humanity’s values when we evaluate AI alignment. Thoughts? Potential issues?
Current alignment methods like RLHF and Constitutional AI often use human evaluators. But these samples might not truly represent humanity or its diverse values. How can we model all human preferences with limited compute?
Maybe we should try collecting a vast dataset of written life experiences and beliefs from people worldwide.
We could then filter and augment this data to aim for better population representation. Vectorize these entries and use k-means to find ‘k’ representative viewpoints.
These ‘k’ vectors could then serve as a more representative proxy for humanity’s values when we evaluate AI alignment. Thoughts? Potential issues?