Nathan Helm-Burger comments on dalle2 comments

Nathan Helm-Burger 26 Apr 2022 21:28 UTC
3 points
0
Seems to me that what CLIP needs is a secondary training regime, where it has an image generated as a 2d render of a 3d scene that is generated from a simulator which can also generate a correct and an several incorrect captions. Like: red vase on a brown table (correct), blue vase on a brown table (incorrect), red vase on a green table (incorrect), red vase under a brown table (incorrect). Then do the CLIP training with the text set deliberately including the inappropriate text samples in addition to the usual random incorrect caption samples. I saw this idea in a paper a few years back, not sure how to find that paper now since I can’t seem to guess the right keywords to get Google to come up with it. But there’s a lot of work related to this idea out there, for example: https://developer.nvidia.com/blog/sim2sg-generating-sim-to-real-scene-graphs-for-transfer-learning/
Do you think that would fix CLIP’s not-precisely-the-right-object in not-precisely-the-right-positional-relationship problem? Maybe also, if the simulated data contained labeled text, then it would also fix the incoherent text problem?
- Nathan Helm-Burger 26 May 2022 4:31 UTC
  1 point
  0
  Parent
  Here’s some work by folks at deepmind looking at model’s relational understanding (verbs) vs subjects and objects. Kinda relevant to the type of misunderstanding CLIP tends to exhibit. https://www.deepmind.com/publications/probing-image-language-transformers-for-verb-understanding