in particular how much causal scrubbing can be turned into an exploratory tool to find circuits rather than just to verify them
I’d like to flag that this has been pretty easy to do—for instance, this process can look like resample ablating different nodes of the computational graph (eg each attention head/MLP), finding the nodes that when ablated most impact the model’s performance and are hence important, and then recursively searching for nodes that are relevant to the current set of important nodes by ablating nodes upstream to each important node.
Thanks for the feedback and corrections! You’re right, I was definitely confusing IRL, which is one approach to value learning, with the value learning project as a whole. I think you’re also right that most of the “Outer alignment concerns” section doesn’t really apply to RLHF as it’s currently written, or at least it’s not immediately clear how it does. Here’s another attempt:
RLHF attempts to infer a reward function from human comparisons of task completions. But it’s possible that a reward function learned from these stated preferences might not be the “actual” reward function—even if we could perfectly predict the human preference ordering on the training set of task completions, it’s hard to guarantee that the learned reward model will generalize to all task completions. We also have to consider that the stated human preferences might be irrational: they could be intransitive or cyclical, for instance. It seems possible to me that a reward model learned from human feedback still has to account for human biases, just as a reward function learned through IRL does.
How’s that for a start?