This is usually called “neuralese” and my personal guess is that it’s just hard to get working. After all, it’s kind of an obvious idea and no one seems to have gotten any convincing results from it. This paper is pretty much a negative data point actually, particularly considering @wassname’s comment that it hasn’t even replicated on these cherry-picked benchmarks. That is not so surprising, since the idea behind transformer training is sort of a rejection of neuralese / recurrence?
This is usually called “neuralese” and my personal guess is that it’s just hard to get working. After all, it’s kind of an obvious idea and no one seems to have gotten any convincing results from it. This paper is pretty much a negative data point actually, particularly considering @wassname’s comment that it hasn’t even replicated on these cherry-picked benchmarks.
That is not so surprising, since the idea behind transformer training is sort of a rejection of neuralese / recurrence?