# Arthur Conmy

Karma: 307
• That’s true, I think the pretraining gradients training choice probably has more effect on the end model than the overfitting SFT model they start PPO with.

Huh, but Mysteries of mode collapse (and the update) were published before td-003 was released? How would you have ended up reading a post claiming td-002 was RLHF-trained when td-003 existed?

Meta note: it’s plausibly net positive that all the training details of these models has been obfuscated, but it’s frustrating how much energy has been sunk into speculation on The Way Things Work Inside OpenAI.

• I wasn’t trying to say mode collapse results were wrong! I collected these results before finding crisper examples of mode collapse that I could build a useful interpretability project on. I also agree with the remarks made about the difficulty of measuring this phenomena. I indeed tried to use the OpenAI embeddings model to encode the various completions and then hopefully have the Euclidean distance be informative, but it seemed to predict large distances for similar completions so I gave up. I also made a consistent color scheme and compared code-davinci, thanks for those suggestions.

I don’t get the impression that RLHF needs hacks to prevent mode collapse: the InstructGPT reports overfitting leading to better human-rater feedback, and the Anthropic HH paper mentions in passing that the KL penalty may be wholly irrelevant (!).

I’m not sure how to interpret the evidence from your first paragraph. You suggest that td-003 mode collapses where td-002 is perfectly capable. So you believe that both td-002 and td-003 mode collapse, in disjoint cases (given the examples from the original mode collapse post)?

• Thanks—this doesn’t seem to change observations much, except there doesn’t seem to be a case where this model has starkly the lowest entropy, as we found with davinci

EDIT: I added code-davinci-002 as the main focus of the post, thanks!

# RLHF does not ap­pear to differ­en­tially cause mode-collapse

20 Mar 2023 15:39 UTC
90 points
• its top singular vector encodes what we think are the *least* frequent tokens.

I spot “GoldMagikarp” and “Skydragon”—we now know these are indeed very infrequent tokens! This was a good evidence for SolidGoldMagikarp lurking in plain sight : )

• I think this point was really overstated. I get the impression the rejected papers were basically turned into the arXiv format as fast as possible and so it was easy for the mods to tell this. However, I’ve seen submissions to cs.LG like this and this that are clearly from the alignment community. These posts are also not stellar by standards of preprint formatting, and were not rejected, apparently

• I had the identical reaction that the statement of this effect was a bizarre framing. @afspies’s comment was helpful—I don’t think the claim is as bizarre now.

(though overall I don’t think this post is a useful contribution because it is more likely to confuse than to shed light on LMs)

• I meant your first point.

Regarding the claim that finetuning on data with property $P$ will lead models to ‘understand’ (scare-quotes omitted from now on...) both $P$ and not $P$ better, thanks. I see better where the post is coming from.

However, I don’t necessarily think that we get the easier elicitation of not $P$. There are reasons to believe finetuning is simply resteering the base model and not changing its understanding at all. For example, there are far more training steps in pretraining vs. finetuning. Even if finetuning is shaping a model’s understanding of $P$, in an RLHF setup you’re generally seeing two responses, one with less $P$ and one with more $P$, and I’m not sure that I buy that the model’s inclination to output not $P$ responses can increase given there are no gradients from not $P$ cases. There are in red-teaming setups though and I think the author should register predictions in advance and then blind test various base models and finetuned models for the Waluigi Effect.

• It is open sourced here and there is material from REMIX to get used to the codebase here

• The Waluigi Effect: After you train an LLM to satisfy a desirable property , then it’s easier to elicit the chatbot into satisfying the exact opposite of property .

I’ve tried several times to engage with this claim, but it remains dubious to me and I didn’t find the croissant example enlightening.

Firstly, I think there is weak evidence that training on properties makes opposite behavior easier to elicit. I believe this claim is largely based on the bing chat story, which may have these properties due to bad finetuning rather than because these finetuning methods cause the Waluigi effect. I think ChatGPT is an example of finetuning making these models more robust to prompt attacks (example).

Secondly (and relatedly) I don’t think this article does enough to disentangle the effect of capability gains from the Waluigi effect. As models become more capable both in pretraining (understanding subtleties in language better) and in finetuning (lowering the barrier of entry for the prompting required to get useful outputs), they will get better at being jailbroken by stranger prompts.

# OpenAI in­tro­duce ChatGPT API at 1/​10th the pre­vi­ous \$/​token

1 Mar 2023 20:48 UTC
28 points