Ofer comments on We may be able to see sharp left turns coming

Ofer 4 Sep 2022 11:34 UTC
LW: 4 AF: 2
0
AF

The smooth graphs seem like good evidence that there are much smoother underlying changes in the model, and that the abruptness of the change is about behavior or evaluation rather than what gradient descent is learning.

If we’re trying to predict abrupt changes in the accuracy of output token sequences, the per-token log-likelihood can be a useful signal. What’s the analogous signal when we’re talking about abrupt changes in a model’s ability to deceptively conceal capabilities, hack GPU firmware, etc.? What log-likelihood plots can we use to predict those types of abrupt changes in behavior?
- Ethan Perez 5 Sep 2022 4:18 UTC
  LW: 1 AF: 1
  0
  AF Parent
  Here, I think we’ll want to look for suspicious changes in the log-likelihood trends. E.g., it’s a red flag if we see steady increases in log-likelihood on some scary behavior, but then the trend reverse at some level of model scale.