We could do auto captioning of movies and videos.Or we could just train multimodal simulators. We probably will (e.g. such models could be useful for generating videos from descriptions).
We could do auto captioning of movies and videos.
Or we could just train multimodal simulators. We probably will (e.g. such models could be useful for generating videos from descriptions).