jacquesthibs comments on The weak-to-strong generalization (WTSG) paper in 60 seconds

jacquesthibs 17 Jan 2024 1:03 UTC
3 points
0
I’ll need to find time to read the paper, but something that comes to mind is the URIAL paper (The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning).
I’m thinking about that paper because they tested to see what SFT and SFT+RLHF caused regarding behavioural changes and noticed that “Most distribution shifts occur with stylistic tokens (e.g., discourse markers, safety disclaimers).” In the case of this paper, they were able to achieve similar performance from the Base model to both the SFT and SFT+RLHF models by leveraging the knowledge regarding the stylistic tokens.
This makes me think that fine-tuning GPT-4 is mostly changing some stylistic parts of the model, but not affecting the core capabilities of the model. I’m curious if this contributes to the model being seemingly incapable of perfectly matching the GPT-2 model. If so, I’m wondering why being able to mostly modify the stylistic tokens places a hard cap on how the GPT-4 can match the GPT-2 model.
I could be totally off, will need to read the paper.