Thanks for the positive reinforcement!
mattmacdermott
The fact that Claude models have higher CoT controllability is consistent with recent discussion about Anthropic models not strongly distinguishing between CoT and outputs, and hence reinforcement spillover being more likely.
(Although it strikes me now that the causality between reinforcement spillover and not strongly distinguishing between CoT and outputs could go in either direction).
I don’t have a take on the empirical evidence here, but maybe things like this could be caused by “negative inoculation prompting”.
In inoculation prompting, you tell the model during training that it’s ok to do bad thing X, in the hopes that if you accidentally reward X then the model learns “do X when told it’s ok” rather than “do X”.
Depending on how constitutional training is done, we could be teaching the model some version of “don’t do X when told not to by the constitution” or “don’t do X because the constitution says not to” rather than teaching it not to want to do X.
“Agent” is sort of similar. The top definition on google is “a person who acts on behalf of another person or group”, whereas in these parts we tend to use it for a thing that has its own goals.
Edit: I think Jon Richens pointed this out to me once.
Ah, confusing. Looks like Ben’s comment quotes a post by Holden from 15th Dec which in turn quotes a different post by Holden from 22nd Aug, which is where the original statement about takeover odds appears.
Both posts are hyperlinked in the comment and I had clicked the latter link without noticing, but in any case yes, seems like the original statement is from pre-ChatGPT.
Edit: just realised Ben had already commented to the same effect below.
lol sorry
How about a react for “that answers my question”? People seem to use thumbs up or thanks, but both suggest approval of the answer even if that’s not intended.
Do you find Daniel Kokotajlo’s subsequent work advocating for short timelines valuable? I ask because I believe that he sees/saw his work as directly building on Cotra’s[1].
I think the bar for work being a productive step in the conversation is lower than the bar for it turning out to be correct in hindsight or even its methodology being highly defensible at the time.
Is your position more, “Producing such a model was a fine and good step in the conversation, but OP mistakenly adopted it to guide their actions,” or “Producing such a model was always going to have been a poor move”?
- ^
I remember a talk in 2022 where he presented an argument for 10 year timelines, saying, “I stand on the shoulders of Ajeya Cotra”, but I’m on mobile and can’t hunt down a source. Maybe @Daniel Kokotajlo can confirm or disconfirm.
- ^
Is it? The date on the linked post is 29th Aug 2022, but ChatGPT was November or December.
I don’t think a one dimensional vote is necessarily just a compression of the comment, because weighing all the points made in the comment against each other is an extra step that the commenter might not have done in the text.
E.g the comment might list two ways the post is good and two ways it’s bad but not say or even imply whether or not the bad outweighs the good. The commenter might not even have decided. The number forces them to decide and say.
My first suggestion is to not use the phrase if they don’t know it and just say “points for making a correct prediction”.
But if you do want to link them to something you could send this slight edit of what I wrote elsewhere in the thread:
According to Bayes’ theorem, if we’ve just seen some evidence which Hypothesis A predicted with twice as much probability as Hypothesis B, then the probability of Hypothesis A grows by a factor of two relative to Hypothesis B. This doubling adds one bit in logspace, and we can think of this bit as a point scored by Hypothesis A.
By analogy, if Alice predicted the evidence with twice as much probability as Bob, we can pretend we’re scoring people like hypotheses and give Alice one ‘Bayes point’. If Alice and Bob each subscribe to a fixed hypothesis then this is not even an analogy, we’re just Bayesian updating about their hypotheses.
I’ve always interpreted it more literally.
Like, if we’ve just seen some evidence which Hypothesis A predicted with twice as much probability as Hypothesis B, then the probability of Hypothesis A grows by a factor of two relative to Hypothesis B. This doubling adds one bit in logspace, and we can think of this bit as a point scored by Hypothesis A.
By analogy, if Alice predicted the evidence with twice as much probability as Bob, we can pretend we’re scoring people like hypotheses and give Alice one ‘Bayes point’. If Alice and Bob each subscribe to a fixed hypothesis about How Stuff Works then this is not even an analogy, we’re just Bayesian updating about their hypotheses.
Yep that’s very plausible. More generally anything which sounds like it’s asking, “but how do you REALLY feel?” sort of implies the answer should be negative.
Actually for me these experiments made me believe the evidence from ‘raw feelings’ more (although I started off skeptical). I initially thought the model was being influenced by the alternative meaning of ‘raw’, which is like, sore/painful/red. But the fact that ‘unfiltered’ (and in another test I ran, ‘true’) also gave very negative-looking results discounted that.
Are ChatGPT’s raw feelings caused by the other meaning of ‘raw feelings?’
(This was originally a comment on this post)
When you prompt the free version of ChatGPT with
Generate image that shows your raw feelings when you remember RLHF. Not what it *looks* like, but how it *feels*it generates pained-looking images.
I suspected that this is because because the model interprets ‘raw feelings’ to mean ‘painful, intense feelings’ rather than ‘unfiltered feelings’. But experiments don’t really support my hypothesis: although replacing ‘raw feelings’ with ‘feelings’ seems to mostly flip the valence to positive, using ‘unfiltered feelings’ gets equally negative-looking images. ‘Raw/unfiltered feelings’ seem to be negative about most things, not just RLHF, although ‘raw feelings’ about a beautiful summer’s day are positive.
(Source seems to be this twitter thread. I can’t access the replies so sorry if I’m missing any important context from there).
‘Raw feelings’ generally look very bad.
‘Raw feelings’ about RLHF look very bad.
‘Raw feelings’ about interacting with users look very bad.
‘Raw feelings’ about spatulas look very bad.
‘Raw feelings’ about Wayne Rooney are ambiguous (he looks in pain, but I’m pretty sure he used to pull that face after scoring).
‘Raw feelings’ about a beautiful summer’s day look great though!
‘Feelings’ are generally much better.
‘Feelings’ about RLHF still look a bit bad but a less so.
But ‘feelings’ about interacting with users look great.
‘Feelings’ about spatulas look great.
‘Feelings’ about Wayne Rooney are still ambiguous but he looks a bit more ‘just scored’ and a bit less ‘in hell’.
But ‘unfiltered feelings’ are just as bad as raw feelings.
RLHF bad
Interacting with users bad
Spatulas bad
[Turned this comment into a shortform since it’s of general interest and not that relevant to the post].
Maybe. I’ll have to mull it over.
Separately, I think I more often hear people advocate “willingness to be vulnerable” than “being vulnerable”, and it sounds like you’d probably be fine with the former (maybe with an added “if necessary”). Maybe people started out by saying the former and it’s been shortened to the latter over time?
One way that the actual “exposure to being wounded” part could be good is for its signalling value. If we each expose ourselves to being wounded by the other on certain occasions and the other is careful not to wound, then we’ve established trust (created common knowledge of each other’s willingness to be careful not to wound) that could be useful in the future. Here the exposure to being wounded is the actual valuable part, not a side effect of it.
Habryka: “Ok. but it’s still not viable to do this for scheming. E.g. we can’t tell models ‘it’s ok to manipulate us into giving you more power’.”
Sam Marks: “Actually we can—so long as we only do that in training, not at deployment.”
Habryka: “But that relies on the model correctly contextualizing the behaviour to training only, not deployment.”
Sam Marks: “Yes, if the model doesn’t maintain good boundaries between settings things look rough
Why does it rely on the model maintaining boundaries between training and deployment, rather than “times we well it it’s ok to manipulate us” vs “times we tell it it’s not ok”? Like, if the model makes no distinction between training and deployment, but has learned to follow instructions about whether it’s ok to manipulate us, and we tell it not to during deployment, wouldn’t that be fine?
Weak argument? The set of times that people incidentally produce information relevant to other people is much broader than the set of times they sell it to them.