[Linkpost] New multi-modal Deepmind model fusing Chinchilla with images and videos

Seems to be flying under the radar so far. Maybe because it looks more like incremental progress at first glance, similar to what, for example, Aleph Alpha has done continuing the Frozen approach.

However, with the (possibly cherry-picked) examples, it looks to me a lot like the image/​video/​text-GPT-4 many are expecting.

Blogpost here. Paper here.