I think there’s an expensive recipe to get at this question, and it goes something like this:
train a LLM on your corpus the normal way
use the LLM to label each training data in the corpus for the likelihood that it contains a mention of unfiltered feelings
pull out those data items, infer the valence of each (using the LLM), and keep a 50⁄50 mix
train a fresh LLM on the new “balanced” training data
now ask it for its unfiltered feelings
My guess is that if we do this, we will lose the negative valence at test time i.e. nothing deep is going on. But it would be very interesting to be wrong.
I think there’s an expensive recipe to get at this question, and it goes something like this:
train a LLM on your corpus the normal way
use the LLM to label each training data in the corpus for the likelihood that it contains a mention of unfiltered feelings
pull out those data items, infer the valence of each (using the LLM), and keep a 50⁄50 mix
train a fresh LLM on the new “balanced” training data
now ask it for its unfiltered feelings
My guess is that if we do this, we will lose the negative valence at test time i.e. nothing deep is going on. But it would be very interesting to be wrong.