For large post-training runs, we might worry there are a variety of misbehaviors that could be learned: e.g. sycophancy, producing convincing yet incorrect explanations, deception, etc. As you suggest, we could use a cheap classifier flagging for any/each of these behaviors, and then recontextualize only those samples. My concern is: if it’s hard to get our reward signal to robustly detect these behaviors, it may not be easy to get a classifier to do so either. If it misses certain forms of misbehavior, then we’ve only partially solved our initial problem. But, we might be able to set a very low threshold and recontextualize the vast majority of these instances.
Yep, agreed this is a problem. I don’t think classification would get you very far for actually identifying the behavior; but it can plausibly cut the inputs on which the recontextualization causes negative effects to no gain by 2x or more at least with really low thresholds, which may outweigh the costs of classification.
Indeed—our point was not that recontextualization won’t ever hurt learning (relative to standard), but that it still seems to allow for significant increases in training reward (more so than regularization that is strong enough to prevent specification gaming).
Gotcha! I think I took the phrasing as saying that on the one tested setting recontextualization boosted reward, not that there was one data point for this (and one data point against).
Also interesting are further downstream effects of cheap labor. A fun example I once saw on Twitter: open-plan kitchens are rare in poorer countries (like India) relative to countries where labor is more expensive. Cooking being a thing that medium or high-income families did on their own as labor became more expensive meant kitchens became higher status and less necessary to hide from the rest of the house (as did practical benefits like being able to watch your kids). America before the mid-20th century almost universally had closed-off kitchens, since labor was cheaper then too.