It’s not entirely clear how and why GPT-4 (possibly a 2e25 FLOPs model) or Gemini Ultra 1.0 (possibly a 1e26 FLOPs model) don’t work as autonomous agents, but it seems that they can’t. So it’s not clear that the next generation of LLMs built in a similar way will enable significant agency either. There are millions of AI GPUs currently being produced each year, and millions of GPUs can only support a 1e28-1e30 FLOPs training run (that doesn’t individually take years to complete). There’s (barely) enough text data for that.
GPT-2 would take about 1e20 FLOPs to train with modern methods, on the FLOPs log scale it’s already further away from GPT-4 than GPT-4 is from whatever is feasible to build in the near future without significant breakthroughs. So there are only about two more generations of LLMs in the near future if most of what changes is scale. It’s not clear that this is enough, and it’s not clear that this is not enough.
With Sora, the underlying capability is not just video generation, it’s also video perception, looking at the world instead of dreaming of it. A sufficiently capable video model might be able to act in the world by looking at it in the same way a chatbot acts in a conversation by reading it. Models that can understand images are already giving new ways of specifying tasks and offering feedback on performance in robotics, and models that can understand video will only do this better.
It’s not entirely clear how and why GPT-4 (possibly a 2e25 FLOPs model) or Gemini Ultra 1.0 (possibly a 1e26 FLOPs model) don’t work as autonomous agents, but it seems that they can’t. So it’s not clear that the next generation of LLMs built in a similar way will enable significant agency either. There are millions of AI GPUs currently being produced each year, and millions of GPUs can only support a 1e28-1e30 FLOPs training run (that doesn’t individually take years to complete). There’s (barely) enough text data for that.
GPT-2 would take about 1e20 FLOPs to train with modern methods, on the FLOPs log scale it’s already further away from GPT-4 than GPT-4 is from whatever is feasible to build in the near future without significant breakthroughs. So there are only about two more generations of LLMs in the near future if most of what changes is scale. It’s not clear that this is enough, and it’s not clear that this is not enough.
With Sora, the underlying capability is not just video generation, it’s also video perception, looking at the world instead of dreaming of it. A sufficiently capable video model might be able to act in the world by looking at it in the same way a chatbot acts in a conversation by reading it. Models that can understand images are already giving new ways of specifying tasks and offering feedback on performance in robotics, and models that can understand video will only do this better.