gwern comments on LLMs as amplifiers, not assistants

gwern 21 Jun 2025 20:07 UTC
14 points
3

use a trick discovered by Janus to get Claude Opus 4 to act more like a base model and drop its “assistant” persona

Have you or Janus done anything more rigorous to check to what extent you are getting ‘the base model’, rather than ‘the assistant persona pretending to be a base model’? This is something I’ve noticed with jailbreaks or other tweaks: you may think you’ve changed the bot persona, but it’s really just playing along with you, and will not be as good as a true base model (even if it’s at least stylistically superior to the regular non-roleplaying bot output).

This is important because a lot of the value of a base model is the rare completions & tails, which cover so much of the distribution, but that is also where a fake base model will still be bad in the chatbot way while looking superficially like a base model. You simply have mode-collapse on a higher level with the pseudo-base persona.

Have you run any of the mode-collapse tests like generating surnames or random numbers, or look at the logits to see if they exhibit the usual extreme skew of a RL-tuned chatbot, or if they match a reference base model’s logits on text samples better than a tuned bot, or if it gets more diverse if you add to the prompt a bunch of random Common Crawl excerpts?