I haven’t watched the video, but are they using expected value at all or are they just using the most likely word? Accidentally using a nonoptimal common word seems like it would produce a better translation than accidentally using a nonoptimal uncommon word, so this effect might just be making their algorithm more like expected utility and less like raw probabilities.
Two possible explanations come to mind:
They are defining “better” as something other than higher expected value of number of correctly translated words.
The probabilities they’re using are biased, and this reverses the bias.
I haven’t watched the video, but are they using expected value at all or are they just using the most likely word? Accidentally using a nonoptimal common word seems like it would produce a better translation than accidentally using a nonoptimal uncommon word, so this effect might just be making their algorithm more like expected utility and less like raw probabilities.