I think “video reasoning” could be an interesting approach as you say.
Like if there are 10 frames and no single frame shows a tennis racket, but if you play them real fast, a human could infer there being a tennis racket because part of the racket is in each frame.
I think “video reasoning” could be an interesting approach as you say.
Like if there are 10 frames and no single frame shows a tennis racket, but if you play them real fast, a human could infer there being a tennis racket because part of the racket is in each frame.