Roman Leventov comments on Discovering Language Model Behaviors with Model-Written Evaluations

Roman Leventov 25 Dec 2022 10:03 UTC
2 points
0
Are there any conclusions we can draw around what levels of scale and RLHF training are likely to be safe, and where the risks really take off? It might be useful to develop some guidelines like “it’s relatively safe to widely deploy language models under 10^10 parameters and under 250 steps of RLHF training”. (Most of the charts seem to have alarming trends starting around 10^10 parameters. ) Based just on these results, I think a world with even massive numbers of 10^10-parameter LLMs in deployment (think CAIS) would be much safer than a world with even a few 10^11 parameter models in use.
It’s not the scale and the number of RLHF steps that we should use as the criteria for using or banning a model, but the empirical observations about the model’s beliefs themselves. A huge model can still be “safe” (below on why I put this word in quotes) because it doesn’t have the belief that it would be better off on this planet without humans or something like that. So what we urgently need to do is to increase investment in interpretability and ELK tools so that we can really be quite certain whether models have certain beliefs. That they will self-evidence themselves according to these beliefs is beyond question. (BTW, I don’t believe at all in the possibility of some “magic” agency, undetectable in principle by interpretability and ELK, breeding inside the LLM that has relatively short training histories, measured as the number of batches and backprops.)
Why I write that the deployment of large models without “dangerous” beliefs is “safe” in quotes: social, economic, and political implications of such a decision could still be very dangerous, from a range of different angles, which I don’t want to go on elaborating here. The crucial point that I want to emphasize is that even though the model itself may be rather weak on the APS scale, we must not think of it in isolation, but think about the coupled dynamics between this model and its environment. In particular, if the model will prove to be astonishingly lucrative for its creators and fascinating (addictive, if you wish) for its users, it’s unlikely to be shut down even if it increases risks, and overall (on the longer timescale) is harmful to humanity, etc. (Think of TikTok as the prototypical example of such a dynamic.) I wrote about this here.