I didn’t realize that’s what happened. I assumed it was accidentally pretrained on similar evals. But you’re right. It looks like they did try to aim alignment training at important behavior in pretty much the way I’m suggesting.
This seems like really bad news for alignment in general. Avoiding training on the important stuff, and hoping unimportant stuff generalizes to cover it does not sound like the greatest strategy.
I wonder if one issue is that they’re training on relatively few examples of ethical behavior in complex situations. These few examples aren’t that diverse in form, so they’re likely to be categorized as training/evals. Doing more training with more diversity of context (the second standard approach) should help.
But I’m not sure how much. It does seem like the attempt pretty much backfired. And that’s pretty bad news.
I didn’t realize that’s what happened. I assumed it was accidentally pretrained on similar evals. But you’re right. It looks like they did try to aim alignment training at important behavior in pretty much the way I’m suggesting.
This seems like really bad news for alignment in general. Avoiding training on the important stuff, and hoping unimportant stuff generalizes to cover it does not sound like the greatest strategy.
I wonder if one issue is that they’re training on relatively few examples of ethical behavior in complex situations. These few examples aren’t that diverse in form, so they’re likely to be categorized as training/evals. Doing more training with more diversity of context (the second standard approach) should help.
But I’m not sure how much. It does seem like the attempt pretty much backfired. And that’s pretty bad news.