Quick question: What does the “% similarity” bar mean? It’s not obviously functional (GO-based) nor is it obviously structural. Several rounds of practice have been waylaid by me misinterpreting what it means for a protein to be 95% similar to the target...
It took me many iterations to settle down on the exact logic. At first, I started with just the HiG2Vec GO similarity embedding. It did alright, but I didn’t like how the same protein family gets wildly different scores based on just pathway participation or tissue expression. I added ESM2 sequence-based embedding to tame this inconsistency. It also resulted in the “your guess is top-9 similar” hint to be arranged in the order of increasing sequence similarity, which is a nice bonus for late-game triangulation.
I tried making a shared embedding out of two separate ones, but ran into statistical issues with how differently I needed to normalize them. Instead, I opted to calculate intermediate “evidence strengths” for each embedding separately, and then combining them into a final similarity percentage in such a way that highly rewards both “only similar by sequence” and “only functionally similar”, so that a player of any background has a chance to close onto the target using their own experience, no matter if it’s the experience in pathways or in structural families.
Really enjoyed this!!
Quick question: What does the “% similarity” bar mean? It’s not obviously functional (GO-based) nor is it obviously structural. Several rounds of practice have been waylaid by me misinterpreting what it means for a protein to be 95% similar to the target...
Thanks for checking it out.
It took me many iterations to settle down on the exact logic. At first, I started with just the HiG2Vec GO similarity embedding. It did alright, but I didn’t like how the same protein family gets wildly different scores based on just pathway participation or tissue expression. I added ESM2 sequence-based embedding to tame this inconsistency. It also resulted in the “your guess is top-9 similar” hint to be arranged in the order of increasing sequence similarity, which is a nice bonus for late-game triangulation.
I tried making a shared embedding out of two separate ones, but ran into statistical issues with how differently I needed to normalize them. Instead, I opted to calculate intermediate “evidence strengths” for each embedding separately, and then combining them into a final similarity percentage in such a way that highly rewards both “only similar by sequence” and “only functionally similar”, so that a player of any background has a chance to close onto the target using their own experience, no matter if it’s the experience in pathways or in structural families.