Seems to me that what CLIP needs is a secondary training regime, where it has an image generated as a 2d render of a 3d scene that is generated from a simulator which can also generate a correct and an several incorrect captions. Like: red vase on a brown table (correct), blue vase on a brown table (incorrect), red vase on a green table (incorrect), red vase under a brown table (incorrect). Then do the CLIP training with the text set deliberately including the inappropriate text samples in addition to the usual random incorrect caption samples. I saw this idea in a paper a few years back, not sure how to find that paper now since I can’t seem to guess the right keywords to get Google to come up with it. But there’s a lot of work related to this idea out there, for example: https://developer.nvidia.com/blog/sim2sg-generating-sim-to-real-scene-graphs-for-transfer-learning/
Do you think that would fix CLIP’s not-precisely-the-right-object in not-precisely-the-right-positional-relationship problem? Maybe also, if the simulated data contained labeled text, then it would also fix the incoherent text problem?
Seems to me that what CLIP needs is a secondary training regime, where it has an image generated as a 2d render of a 3d scene that is generated from a simulator which can also generate a correct and an several incorrect captions. Like: red vase on a brown table (correct), blue vase on a brown table (incorrect), red vase on a green table (incorrect), red vase under a brown table (incorrect). Then do the CLIP training with the text set deliberately including the inappropriate text samples in addition to the usual random incorrect caption samples. I saw this idea in a paper a few years back, not sure how to find that paper now since I can’t seem to guess the right keywords to get Google to come up with it. But there’s a lot of work related to this idea out there, for example: https://developer.nvidia.com/blog/sim2sg-generating-sim-to-real-scene-graphs-for-transfer-learning/
Do you think that would fix CLIP’s not-precisely-the-right-object in not-precisely-the-right-positional-relationship problem? Maybe also, if the simulated data contained labeled text, then it would also fix the incoherent text problem?
Here’s some work by folks at deepmind looking at model’s relational understanding (verbs) vs subjects and objects. Kinda relevant to the type of misunderstanding CLIP tends to exhibit. https://www.deepmind.com/publications/probing-image-language-transformers-for-verb-understanding