I assume you mean ‘won’t generalize to answering questions about both modalities’, and that’s false.
Oops, my wording was confusing. I was imagining something like having a transformer which can take in both text tokens and image tokens (patches), but each training sequence is either only images or only text. (Let’s also suppose we strip text out of images for simplicity.)
Then, we generalize to a context which has both images and text and ask the model “How many dogs are in the image?”
Oops, my wording was confusing. I was imagining something like having a transformer which can take in both text tokens and image tokens (patches), but each training sequence is either only images or only text. (Let’s also suppose we strip text out of images for simplicity.)
Then, we generalize to a context which has both images and text and ask the model “How many dogs are in the image?”