Why does θ1 need to include part of the world model? Why not instead have θ1 be the parameters of the two heads, and θ2 be the parameters of the rest of the model?
This would mean that you can’t initialize θ2 to be equal to θ1, but I don’t see why that’s necessary in the first place—in particular it seems like the following generative model should work just fine:
P(θ1)∝exp(−||θ1−θ1,init||2)
P(θ2∣θ1)∝exp(−λC(θ1,θ2)−||θ2−θ2,init||2)
(I’ll be thinking of this setup for the rest of my comment, as it makes more sense to me)
When differentiating the consistency test C we should treat the intended head as fixed rather than differentiating through it. This removes SGD’s incentive to achieve consistency by e.g. making sure the world is simple and so all questions have simple answers.
Hmm, why is this necessary? It seems like the whole point of L(θ2) is to ensure that you have to learn a detailed world model that gets you the right answers. I guess as λ→∞, that doesn’t really help you, but really you shouldn’t have λ→∞ because you shouldn’t expect to be able to have C(θ1,θ2)→0.
(Also, shouldn’t that be L(θ1,θ2), since it is θ1 and θ2 together that compute answers to questions?)
Some confusions I have:
Why does θ1 need to include part of the world model? Why not instead have θ1 be the parameters of the two heads, and θ2 be the parameters of the rest of the model?
This would mean that you can’t initialize θ2 to be equal to θ1, but I don’t see why that’s necessary in the first place—in particular it seems like the following generative model should work just fine:
P(θ1)∝exp(−||θ1−θ1,init||2)
P(θ2∣θ1)∝exp(−λC(θ1,θ2)−||θ2−θ2,init||2)
(I’ll be thinking of this setup for the rest of my comment, as it makes more sense to me)
Hmm, why is this necessary? It seems like the whole point of L(θ2) is to ensure that you have to learn a detailed world model that gets you the right answers. I guess as λ→∞, that doesn’t really help you, but really you shouldn’t have λ→∞ because you shouldn’t expect to be able to have C(θ1,θ2)→0.
(Also, shouldn’t that be L(θ1,θ2), since it is θ1 and θ2 together that compute answers to questions?)