paulfchristiano comments on ARC’s first technical report: Eliciting Latent Knowledge

paulfchristiano 19 Dec 2021 21:59 UTC
LW: 7 AF: 5
0
AF
For example, the “How we’d approach ELK in practice” section talks about combining several of the regularizers proposed by the “builder.” It also seems like you believe that combining multiple regularizers would create a “stacking” benefit, driving the odds of success ever higher.
This is because of the remark on ensembling—as long as we aren’t optimizing for scariness (or diversity for diversity’s sake), it seems like it’s way better to have tons of predictors and then see if any of them report tampering. So adding more techniques improves our chances of getting a win. And if the cost of fine-tuning a reporters is small relative to the cost of training the predictor, we can potentially build a very large ensemble relatively cheaply.
(Of course, having more techniques also helps because you can test many of them in practice and see which of them seem to really help.)
This is also true for data—I’d be scared about generating a lot of riskier data, except that we can just do both and see if either of them reports tampering in a given case (since they appear to fail for different reasons).
It also seems like you believe that combining multiple regularizers would create a “stacking” benefit, driving the odds of success ever higher.
I believe this in a few cases (especially combining “compress the predictor,” imitative generalization, penalizing upstream dependence, and the kitchen sink of consistency checks) but mostly the stacking is good because ensembling means that having more and more options is better and better.
Right now, the writeup talks about possible worlds in which a given regularizer could be helpful, and possible worlds in which it could be unhelpful. I’d value more discussion of the intuition for whether each one is likely to be helpful, and in particular, whether it’s likely to be helpful in worlds where the previous ones are turning out unhelpful.
I don’t think the kind of methodology used in this report (or by ARC more generally) is very well-equipped to answer most of these questions. Once we give up on the worst case, I’m more inclined to do much messier and more empirically grounded reasoning. I do think we can learn some stuff in advance but in order to do so it requires getting really serious about it (and still really wants to learn from early experiments and mostly focus on designing experiments) rather than taking potshots. This is related to a lot of my skepticism about other theoretical work.
I do expect the kind of research we are doing now to help with ELK in practice even if the worst case problem is impossible. But the particular steps we are taking now are mostly going to help by suggesting possible algorithms and difficulties; we’d then want to give those as one input into that much messier process in order to think about what’s really going to happen.
In this case, it seems like penalizing complexity, computation time, and ‘downstream variables’ (via rewarding reporters for requesting access to limited activations) probably make things worse. (I think this applies less to the last two regularizers listed.)
I think this is plausible for complexity and to a lesser extent for computation time. I don’t think it’s very plausible for the most exciting regularizers, e.g. a good version of penalizing dependence on upstream nodes or the versions of computation time that scale best (and are really trying to incentivize the model to “reuse” inference that was done in the AI model). I think I do basically believe the arguments given in those cases, e.g. I can’t easily see how translation into the human ontology can be more downstream than “use the stuff to generate observations then parse those observations.”