My take would be that the text vector encodes a distribution over images, so it contains a lot of image-ness and constrains the image generation conditioned on this vector a lot in what the image looks like.
The image vector encodes a distribution over captions so it contains a lot of caption-ness and strongly constrains the topic of the generated image but leaves the details of the image much less constrained.
Therefore, at least from my position of relative ignorance, it makes sense that the image vector is better for diversity.
My take would be that the text vector encodes a distribution over images, so it contains a lot of image-ness and constrains the image generation conditioned on this vector a lot in what the image looks like.
The image vector encodes a distribution over captions so it contains a lot of caption-ness and strongly constrains the topic of the generated image but leaves the details of the image much less constrained.
Therefore, at least from my position of relative ignorance, it makes sense that the image vector is better for diversity.