I’ve no idea what OpenAI actually does, but just as a matter of general probabilistic modeling, a model that has learned to predict the next word given previous words has also implicitly learned a model of the joint distribution of all words. (Since the joint probability of a, b, c is just P(a)P(b|a)P(c|a,b).) Given the joint distribution of all words, you can go backwards and deduce the conditional distribution of each word given the following words. Or you can get the conditional distribution of a word given all words both before and after. These conditional distributions are probably harder to get computationally than the forward conditionals that the model directly gives, but the computations are probably not completely infeasible.
So in theory there’s no benefit from training on the backwards sequence as well as the forward sequence, though in practice it’s conceivable that there could be (since the training procedure is no doubt only an approximation to an ideal statistical procedure, and this approximation might conceivably work better when training goes both ways, though off hand this seems unlikely).
I’ve no idea what OpenAI actually does, but just as a matter of general probabilistic modeling, a model that has learned to predict the next word given previous words has also implicitly learned a model of the joint distribution of all words. (Since the joint probability of a, b, c is just P(a)P(b|a)P(c|a,b).) Given the joint distribution of all words, you can go backwards and deduce the conditional distribution of each word given the following words. Or you can get the conditional distribution of a word given all words both before and after. These conditional distributions are probably harder to get computationally than the forward conditionals that the model directly gives, but the computations are probably not completely infeasible.
So in theory there’s no benefit from training on the backwards sequence as well as the forward sequence, though in practice it’s conceivable that there could be (since the training procedure is no doubt only an approximation to an ideal statistical procedure, and this approximation might conceivably work better when training goes both ways, though off hand this seems unlikely).