it’s good to have similar DNAs assigned similar representations, and one can read off phylogenic distance from similarity
You’re bang-on here. The averaging-over method would prevent sequence similarity readoffs, but it wouldn’t prevent any genome-level characters such as kmer spectra, GC content or average linkage disequilibrium, which are correlated with phylogenies but are not deterministic of phylogenetic placement. This point is made in the second half of the Goodfire article yet it is not pointed out that this contradicts the first half.
I would be very surprised if Evo etc. do faithfully represent a phylogeny because phylogenies do not contain much information about genomic characteristics except at very broad scales. A better way to determine this would be to build a phylogeny from the representations and compare those, not the distances in the representation space with the phylogenetic distance. Phylogenies are used for their topology and branch lengths, of which phylogenetic distance is a poor summary.
You’re bang-on here. The averaging-over method would prevent sequence similarity readoffs, but it wouldn’t prevent any genome-level characters such as kmer spectra, GC content or average linkage disequilibrium, which are correlated with phylogenies but are not deterministic of phylogenetic placement. This point is made in the second half of the Goodfire article yet it is not pointed out that this contradicts the first half.
I would be very surprised if Evo etc. do faithfully represent a phylogeny because phylogenies do not contain much information about genomic characteristics except at very broad scales. A better way to determine this would be to build a phylogeny from the representations and compare those, not the distances in the representation space with the phylogenetic distance. Phylogenies are used for their topology and branch lengths, of which phylogenetic distance is a poor summary.