For large post-training runs, we might worry there are a variety of misbehaviors that could be learned: e.g. sycophancy, producing convincing yet incorrect explanations, deception, etc. As you suggest, we could use a cheap classifier flagging for any/each of these behaviors, and then recontextualize only those samples. My concern is: if it’s hard to get our reward signal to robustly detect these behaviors, it may not be easy to get a classifier to do so either. If it misses certain forms of misbehavior, then we’ve only partially solved our initial problem. But, we might be able to set a very low threshold and recontextualize the vast majority of these instances.
Yep, agreed this is a problem. I don’t think classification would get you very far for actually identifying the behavior; but it can plausibly cut the inputs on which the recontextualization causes negative effects to no gain by 2x or more at least with really low thresholds, which may outweigh the costs of classification.
Indeed—our point was not that recontextualization won’t ever hurt learning (relative to standard), but that it still seems to allow for significant increases in training reward (more so than regularization that is strong enough to prevent specification gaming).
Gotcha! I think I took the phrasing as saying that on the one tested setting recontextualization boosted reward, not that there was one data point for this (and one data point against).
Yep, agreed this is a problem. I don’t think classification would get you very far for actually identifying the behavior; but it can plausibly cut the inputs on which the recontextualization causes negative effects to no gain by 2x or more at least with really low thresholds, which may outweigh the costs of classification.
Gotcha! I think I took the phrasing as saying that on the one tested setting recontextualization boosted reward, not that there was one data point for this (and one data point against).