cdt comments on I made Geneguessr

cdt 22 Dec 2025 15:34 UTC
2 points
0
Really enjoyed this!!
Quick question: What does the “% similarity” bar mean? It’s not obviously functional (GO-based) nor is it obviously structural. Several rounds of practice have been waylaid by me misinterpreting what it means for a protein to be 95% similar to the target...
- Brinedew 23 Dec 2025 5:27 UTC
  1 point
  0
  Parent
  Thanks for checking it out.
  It took me many iterations to settle down on the exact logic. At first, I started with just the HiG2Vec GO similarity embedding. It did alright, but I didn’t like how the same protein family gets wildly different scores based on just pathway participation or tissue expression. I added ESM2 sequence-based embedding to tame this inconsistency. It also resulted in the “your guess is top-9 similar” hint to be arranged in the order of increasing sequence similarity, which is a nice bonus for late-game triangulation.
  I tried making a shared embedding out of two separate ones, but ran into statistical issues with how differently I needed to normalize them. Instead, I opted to calculate intermediate “evidence strengths” for each embedding separately, and then combining them into a final similarity percentage in such a way that highly rewards both “only similar by sequence” and “only functionally similar”, so that a player of any background has a chance to close onto the target using their own experience, no matter if it’s the experience in pathways or in structural families.