I’ve been interested recently in understanding generalization. So I spent some time reading this paper, which has been sent to me by David Africa a few times.
Tl;dr suppose you’re interested in how a given parametric function will generalize to new examples. However, you don’t have the new examples yet, all you can have is some existing dataset of samples. You can calculate (i) the empirical performance, e.g. accuracy, on the examples you have, and (ii) various summary statistics of the parameters. The question is how well you can predict the “generalization gap” to the new examples.
I’m not going to focus very much on the object level takeaways because I think there are a bunch of caveats which make them not very interesting, e.g. they only consider in-distribution generalization (experiments on the train/test split of CIFAR-100), they focus only on models trained to a fixed loss threshold (who decided this?), they study convolutional nets on CIFAR-100 (pretty small, not language models). Overall it’s unclear what insights survive the jump to the “frontier model alignment” setting.
But I think the general playbook here might be worth replicating on language models, i.e. construct a large, diverse population of models, empirically measure how well they generalise, and then see how well various heuristics predict this.
The goal would be to build empirical laws of (alignment) generalization that we believe apply to frontier models. I guess this is similar in spirit to Apollo’s science of scheming agenda.
I’ve been interested recently in understanding generalization. So I spent some time reading this paper, which has been sent to me by David Africa a few times.
Tl;dr suppose you’re interested in how a given parametric function will generalize to new examples. However, you don’t have the new examples yet, all you can have is some existing dataset of samples. You can calculate (i) the empirical performance, e.g. accuracy, on the examples you have, and (ii) various summary statistics of the parameters. The question is how well you can predict the “generalization gap” to the new examples.
I’m not going to focus very much on the object level takeaways because I think there are a bunch of caveats which make them not very interesting, e.g. they only consider in-distribution generalization (experiments on the train/test split of CIFAR-100), they focus only on models trained to a fixed loss threshold (who decided this?), they study convolutional nets on CIFAR-100 (pretty small, not language models). Overall it’s unclear what insights survive the jump to the “frontier model alignment” setting.
But I think the general playbook here might be worth replicating on language models, i.e. construct a large, diverse population of models, empirically measure how well they generalise, and then see how well various heuristics predict this.
The goal would be to build empirical laws of (alignment) generalization that we believe apply to frontier models. I guess this is similar in spirit to Apollo’s science of scheming agenda.
Purely conceptual, but I found Toby Ord’s ‘Interpolation, Extrapolation, Hyperpolation’ thought-provoking for thinking about generalization at a high level.