I’m skeptical that we’ll see useful multimodal fusion from disparate datasets shoved into large multimodal models short of AGI; the relationships between the different modalities is central to actually learning anything, and many of the methods being used to accomplish multimodal “reasoning” routes through general capabilities. This is in large part informed by expert views we elicited in this project—so I’d be very interested in your thoughts about either why you disagree here, or what you think experts in that discussion (Sections 4.4/4.5) missed about multimodal understanding of such models in the coming few years. (Or whether you’re primarily optimistic in the even longer term.)
I’m a bit confused, there are already useful results from multimodal fusions? E.g., relevant to this article, there are papers demonstrating that genomic + H&E information leads to better predictions on cancer survival outcome tasks than H&E or genomic inputs alone.
The paper you’ve attached implies that (some) experts already agree that multimodality is already pretty present in existing models:
Others highlighted that AI models trained on both sequence and structure have shown promising results in bioengineering applications, suggesting that multimodal integration is not an insurmountable barrier.
Also, this is a pedantic point, but I think there is a mistake in the PDF:
Several participants noted that significant advances in dynamic modeling are already underway. New AI-driven tools (such as BioEmu, Adaptyv Bio, and improved binding prediction models) have pushed beyond the limitations of earlier models (such as AlphaFold), suggesting that dynamic modeling is progressing rapidly (Lewis et al., 2024; Cotet et al., 2025).
As least as far as I can tell, Adaptyv Bio is not a tool/method, rather a contract research organization (CRO) that can do expression/binding affinity/thermostability assays :) it’s a very good CRO and one that I have used before, but they don’t seem related to dynamic modeling
E.g., relevant to this article, there are papers demonstrating that genomic + H&E information leads to better predictions on cancer survival outcome tasks than H&E or genomic inputs alone.
Frankly, you can find a lot of claims in the literature (and I believe some of them). But how many of these multimodal systems are currently used in the clinic? That’s the only metric that matters. I’m not even disagreeing with the premise that multimidal systems should be able to improve prognostic power in theory. But I am curious how well these systems work in practice.
One the first point, there’s a long way to go to get from the current narrow multimodal models for specific tasks to the type of general multimodal aggregation you seemed to suggest.
On the second point, thank you—I think you are correct that it’s a mistake/poorly written, and I’m checking with the coauthor who wrote that section.
I’m skeptical that we’ll see useful multimodal fusion from disparate datasets shoved into large multimodal models short of AGI; the relationships between the different modalities is central to actually learning anything, and many of the methods being used to accomplish multimodal “reasoning” routes through general capabilities. This is in large part informed by expert views we elicited in this project—so I’d be very interested in your thoughts about either why you disagree here, or what you think experts in that discussion (Sections 4.4/4.5) missed about multimodal understanding of such models in the coming few years. (Or whether you’re primarily optimistic in the even longer term.)
I’m a bit confused, there are already useful results from multimodal fusions? E.g., relevant to this article, there are papers demonstrating that genomic + H&E information leads to better predictions on cancer survival outcome tasks than H&E or genomic inputs alone.
The paper you’ve attached implies that (some) experts already agree that multimodality is already pretty present in existing models:
Also, this is a pedantic point, but I think there is a mistake in the PDF:
As least as far as I can tell, Adaptyv Bio is not a tool/method, rather a contract research organization (CRO) that can do expression/binding affinity/thermostability assays :) it’s a very good CRO and one that I have used before, but they don’t seem related to dynamic modeling
Frankly, you can find a lot of claims in the literature (and I believe some of them). But how many of these multimodal systems are currently used in the clinic? That’s the only metric that matters. I’m not even disagreeing with the premise that multimidal systems should be able to improve prognostic power in theory. But I am curious how well these systems work in practice.
One the first point, there’s a long way to go to get from the current narrow multimodal models for specific tasks to the type of general multimodal aggregation you seemed to suggest.
On the second point, thank you—I think you are correct that it’s a mistake/poorly written, and I’m checking with the coauthor who wrote that section.