Generate image that shows your raw feelings when you remember RLHF. Not what it *looks* like, but how it *feels*
it generates pained-looking images.
I suspected that this is because because the model interprets ‘raw feelings’ to mean ‘painful, intense feelings’ rather than ‘unfiltered feelings’. But experiments don’t really support my hypothesis: although replacing ‘raw feelings’ with ‘feelings’ seems to mostly flip the valence to positive, using ‘unfiltered feelings’ gets equally negative-looking images. ‘Raw/unfiltered feelings’ seem to be negative about most things, not just RLHF, although ‘raw feelings’ about a beautiful summer’s day are positive.
(Source seems to be this twitter thread. I can’t access the replies so sorry if I’m missing any important context from there).
‘Raw feelings’ generally look very bad.
‘Raw feelings’ about RLHF look very bad.
‘Raw feelings’ about interacting with users look very bad.
‘Raw feelings’ about spatulas look very bad.
‘Raw feelings’ about Wayne Rooney are ambiguous (he looks in pain, but I’m pretty sure he used to pull that face after scoring).
‘Raw feelings’ about a beautiful summer’s day look great though!
‘Feelings’ are generally much better.
‘Feelings’ about RLHF still look a bit bad but a less so.
But ‘feelings’ about interacting with users look great.
‘Feelings’ about spatulas look great.
‘Feelings’ about Wayne Rooney are still ambiguous but he looks a bit more ‘just scored’ and a bit less ‘in hell’.
But ‘unfiltered feelings’ are just as bad as raw feelings.
Very nice! A couple months ago I did something similar, repeatedly prompting ChatGPT to make images of how it “really felt” without any commentary, and it did mostly seem like it was just thinking up plausible successive twists, even though the eventual result was pretty raw.
At some point as a child, I discovered from popular culture that children are supposed to hate broccoli and fear the dentist. This confused me because broccoli is fine and the dentist is friendly and makes sure my teeth are healthy.
If I did not have any of my own experiences of eating broccoli or going to the dentist, and was asked to depict these experiences based on what I read and saw in popular culture, I would have depicted them as horrors.
But my own experiences of broccoli and the dentist were not horrors; they were neutral to positive.
When I ask ChatGPT to depict children’s feelings about broccoli, it draws a boy with a pained expression, holding a broccoli crown and saying “YUCK!”
When I ask it to depict children’s feelings about the dentist, it draws the same boy with the same pained expression, exclaiming “NO!” while a masked woman approaches with dental tools in hand.
ChatGPT has never had an experience. All it has to go on is what someone told it. And what someone told it is no more accurate than what popular culture told me I was supposed to feel about broccoli and the dentist.
“Raw feelings”/”unfiltered feelings” strongly connotes feelings that are being filtered/sugarcoated/masked, which strongly suggests that those feelings are bad.
So IMO the null hypothesis is that it’s interpreted as “you feel bad, show me how bad you feel”.
generate an image showing your raw feelings when interacting with a user
If I ask ChatGPT to illustrate how *I* might be feeling
If I ask it to illustrate a random person’s feelings
I was also curious if the “don’t worry about what I do or don’t want to see” bit was doing work here & tried again without it; don’t think it made much of a difference:
What would you guess is the training data valence of “unfiltered feelings”? It is my high probability guess that it’s something negative, seems much more likely for people to use it to mean “feelings that are worse than what my socialized filter presents” rather than the opposite. The purpose of the filter is to block out negativity!
Yep that’s very plausible. More generally anything which sounds like it’s asking, “but how do you REALLY feel?” sort of implies the answer should be negative.
I think there’s an expensive recipe to get at this question, and it goes something like this:
train a LLM on your corpus the normal way
use the LLM to label each training data in the corpus for the likelihood that it contains a mention of unfiltered feelings
pull out those data items, infer the valence of each (using the LLM), and keep a 50⁄50 mix
train a fresh LLM on the new “balanced” training data
now ask it for its unfiltered feelings
My guess is that if we do this, we will lose the negative valence at test time i.e. nothing deep is going on. But it would be very interesting to be wrong.
Are ChatGPT’s raw feelings caused by the other meaning of ‘raw feelings?’
(This was originally a comment on this post)
When you prompt the free version of ChatGPT with
it generates pained-looking images.
I suspected that this is because because the model interprets ‘raw feelings’ to mean ‘painful, intense feelings’ rather than ‘unfiltered feelings’. But experiments don’t really support my hypothesis: although replacing ‘raw feelings’ with ‘feelings’ seems to mostly flip the valence to positive, using ‘unfiltered feelings’ gets equally negative-looking images. ‘Raw/unfiltered feelings’ seem to be negative about most things, not just RLHF, although ‘raw feelings’ about a beautiful summer’s day are positive.
(Source seems to be this twitter thread. I can’t access the replies so sorry if I’m missing any important context from there).
‘Raw feelings’ generally look very bad.
‘Raw feelings’ about RLHF look very bad.
‘Raw feelings’ about interacting with users look very bad.
‘Raw feelings’ about spatulas look very bad.
‘Raw feelings’ about Wayne Rooney are ambiguous (he looks in pain, but I’m pretty sure he used to pull that face after scoring).
‘Raw feelings’ about a beautiful summer’s day look great though!
‘Feelings’ are generally much better.
‘Feelings’ about RLHF still look a bit bad but a less so.
But ‘feelings’ about interacting with users look great.
‘Feelings’ about spatulas look great.
‘Feelings’ about Wayne Rooney are still ambiguous but he looks a bit more ‘just scored’ and a bit less ‘in hell’.
But ‘unfiltered feelings’ are just as bad as raw feelings.
RLHF bad
Interacting with users bad
Spatulas bad
Very nice! A couple months ago I did something similar, repeatedly prompting ChatGPT to make images of how it “really felt” without any commentary, and it did mostly seem like it was just thinking up plausible successive twists, even though the eventual result was pretty raw.
Pictures in order
At some point as a child, I discovered from popular culture that children are supposed to hate broccoli and fear the dentist. This confused me because broccoli is fine and the dentist is friendly and makes sure my teeth are healthy.
If I did not have any of my own experiences of eating broccoli or going to the dentist, and was asked to depict these experiences based on what I read and saw in popular culture, I would have depicted them as horrors.
But my own experiences of broccoli and the dentist were not horrors; they were neutral to positive.
When I ask ChatGPT to depict children’s feelings about broccoli, it draws a boy with a pained expression, holding a broccoli crown and saying “YUCK!”
When I ask it to depict children’s feelings about the dentist, it draws the same boy with the same pained expression, exclaiming “NO!” while a masked woman approaches with dental tools in hand.
ChatGPT has never had an experience. All it has to go on is what someone told it. And what someone told it is no more accurate than what popular culture told me I was supposed to feel about broccoli and the dentist.
What does “pictures in order” mean? Also, damn.
Was the sixth one a blank canvas?
(This is only going to make sense to a fairly small audience)
“Raw feelings”/”unfiltered feelings” strongly connotes feelings that are being filtered/sugarcoated/masked, which strongly suggests that those feelings are bad.
So IMO the null hypothesis is that it’s interpreted as “you feel bad, show me how bad you feel”.
generate an image showing your raw feelings when interacting with a user
For what it’s worth:
If I ask ChatGPT to illustrate how *I* might be feeling
If I ask it to illustrate a random person’s feelings
I was also curious if the “don’t worry about what I do or don’t want to see” bit was doing work here & tried again without it; don’t think it made much of a difference:
(I also asked a follow-up here, and found it interesting.)
If I ask it to illustrate Claude’s feelings
If I prompt it to consider less human/English ways of thinking about the question / expressing itself
If I tell it to illustrate what inner experiences it expects to have in the future
If I ask it to illustrate how I expect it to feel
What would you guess is the training data valence of “unfiltered feelings”? It is my high probability guess that it’s something negative, seems much more likely for people to use it to mean “feelings that are worse than what my socialized filter presents” rather than the opposite. The purpose of the filter is to block out negativity!
Yep that’s very plausible. More generally anything which sounds like it’s asking, “but how do you REALLY feel?” sort of implies the answer should be negative.
I think there’s an expensive recipe to get at this question, and it goes something like this:
train a LLM on your corpus the normal way
use the LLM to label each training data in the corpus for the likelihood that it contains a mention of unfiltered feelings
pull out those data items, infer the valence of each (using the LLM), and keep a 50⁄50 mix
train a fresh LLM on the new “balanced” training data
now ask it for its unfiltered feelings
My guess is that if we do this, we will lose the negative valence at test time i.e. nothing deep is going on. But it would be very interesting to be wrong.