I will try to more directly express the positive intuition for why all of this seems possible to me, that is why I think such a loss function over heuristic arguments that makes all the correct tradeoffs should exist.
Consider the process of SGD as a process of Bayesian model selection. We start with some prior of possible weights of a model in some GPT architecture, then we update based on a series of data and in the end we get some model. We might similarly then have a bunch of objections to how such a model selection process could ever learn the data, e.g. that we don’t have enough parameters to memorize every fact like “apples fall down” “pears fall down” etc., so how will the model know when to try to compress these facts into an underlying theory? And for other things like Barack Obama, how will the model learn to memorize that fact, but not the facts about fruits falling down? How is it possible to have a loss function that treats “Obama follows Barack” as an axiom, but “apples fall down” as a fact derived from some more general beliefs about gravity?
The answer is, of course, that we don’t really need to deal with any of that and we can just make loss go down, and if we’ve set up the learning problem correctly then SGD will magically do all these tradeoffs for us.
In the heuristic argument frame, the hope is thus less “we will find a loss function that somehow does all these tradeoffs in a way that magically works” but rather that we can find some loss function over heuristic arguments that does the same thing as what SGD is in some sense already doing to find a model that somehow compresses common crawl so well. That is, our loss function only needs to learn treat “Obama follows Barack” as axiomatic in so far as SGD learns to treat “Obama follows Barack” as axiomatic.
And the hope is that if we do this correctly, the we can identify deceptive alignment because deceptive alignment is defined to be your model intentionally deceiving, and thus “model acts deceptively” is not, from the perspective of model/SGD, an axiomatic fact, so as long as our loss function over heuristic arguments is properly “parallel” to SGD, then it will not learn to treat “model acts deceptive” as axiomatic (because it will only treat things as axiomatic if model/SGD treat them as axiomatic).
Another way of saying this is that SGD + architecture implicitly assign some “probability” (probably not really in a way that is a distribution in any sense) to any fact F being “axiomatic” and uses data to learn which facts are axiomatic vs not, and so the heuristic argument machinery must assign the same “probability” that facts are axiomatic and do the same kind of learning.
I still don’t see it, sorry. If I think of deep learning as an approximation of some kind of simplicity prior + updating on empirical evidence, I’m not very surprised that it solves the capacity allocation problem and learns a productive model of the world. [1] The price is that the simplicity prior doesn’t necessarily get rid of scheming. The big extra challenge for heuristic explanations is that you need to do the same capacity allocation in a way that scheming reliably gets explained (even though it’s not relevant for the model’s performance and doesn’t make things classically simpler), while no capacity is spent on explaining other phenomena that are not relevant for the model’s performance. I still don’t see at all how we can get the the non-malign prior that can do that.
I will try to more directly express the positive intuition for why all of this seems possible to me, that is why I think such a loss function over heuristic arguments that makes all the correct tradeoffs should exist.
Consider the process of SGD as a process of Bayesian model selection. We start with some prior of possible weights of a model in some GPT architecture, then we update based on a series of data and in the end we get some model. We might similarly then have a bunch of objections to how such a model selection process could ever learn the data, e.g. that we don’t have enough parameters to memorize every fact like “apples fall down” “pears fall down” etc., so how will the model know when to try to compress these facts into an underlying theory? And for other things like Barack Obama, how will the model learn to memorize that fact, but not the facts about fruits falling down? How is it possible to have a loss function that treats “Obama follows Barack” as an axiom, but “apples fall down” as a fact derived from some more general beliefs about gravity?
The answer is, of course, that we don’t really need to deal with any of that and we can just make loss go down, and if we’ve set up the learning problem correctly then SGD will magically do all these tradeoffs for us.
In the heuristic argument frame, the hope is thus less “we will find a loss function that somehow does all these tradeoffs in a way that magically works” but rather that we can find some loss function over heuristic arguments that does the same thing as what SGD is in some sense already doing to find a model that somehow compresses common crawl so well. That is, our loss function only needs to learn treat “Obama follows Barack” as axiomatic in so far as SGD learns to treat “Obama follows Barack” as axiomatic.
And the hope is that if we do this correctly, the we can identify deceptive alignment because deceptive alignment is defined to be your model intentionally deceiving, and thus “model acts deceptively” is not, from the perspective of model/SGD, an axiomatic fact, so as long as our loss function over heuristic arguments is properly “parallel” to SGD, then it will not learn to treat “model acts deceptive” as axiomatic (because it will only treat things as axiomatic if model/SGD treat them as axiomatic).
Another way of saying this is that SGD + architecture implicitly assign some “probability” (probably not really in a way that is a distribution in any sense) to any fact F being “axiomatic” and uses data to learn which facts are axiomatic vs not, and so the heuristic argument machinery must assign the same “probability” that facts are axiomatic and do the same kind of learning.
I still don’t see it, sorry. If I think of deep learning as an approximation of some kind of simplicity prior + updating on empirical evidence, I’m not very surprised that it solves the capacity allocation problem and learns a productive model of the world. [1] The price is that the simplicity prior doesn’t necessarily get rid of scheming. The big extra challenge for heuristic explanations is that you need to do the same capacity allocation in a way that scheming reliably gets explained (even though it’s not relevant for the model’s performance and doesn’t make things classically simpler), while no capacity is spent on explaining other phenomena that are not relevant for the model’s performance. I still don’t see at all how we can get the the non-malign prior that can do that.
Though I’m still very surprised that it works in practice.