I have my genome data from both 23andMe and BGI. I am wondering what to make of it. BGI reports about thirty times as many SNPs as 23andMe. 23andMe: 598897, BGI: 19695817.
Of these, 475801 are reported by both. I looked to see how well they agree with each other, and summarised the results as a count, for each occurring pair of results, of how often that pair occurred. In descending numerical order, and classifying them by type of match or mismatch, this is what I get. (No individual SNPs are identified here.)
87565 CC CC
86952 GG GG
75289 TT TT
75087 AA AA
31069 CT CT
30817 AG GA
27542 CT TC
27484 AG AG
6818 AC CA
6767 GT GT
6373 AC AC
6297 GT TG
270 CG GC
251 CG CG
146 AT TA
138 AT AT
420 C C
402 G G
336 A A
291 T T
582 CT --
576 AG --
426 CC --
399 GG --
348 -- CC
340 -- GG
330 TT --
316 AA --
270 -- AA
240 -- TT
139 GT --
136 AC --
123 -- GA
121 -- CT
113 -- TG
110 -- TC
104 -- GT
101 -- CA
93 -- AC
86 -- AG
26 -- --
5 -- AT
4 CG --
4 -- GC
3 -- TA
2 AT --
2 -- CG
14 C --
13 T --
9 G --
8 -- C
7 A --
5 -- G
2 -- T
51 CC CT
33 AG AA
32 AG GG
31 GG GA
31 CT TT
30 CT CC
25 TT TC
23 AA AG
18 GG AG
15 CC TC
15 AA GA
11 TT TG
11 CC CA
9 TT GT
9 TT CT
9 AC AA
7 CC AC
7 AC CC
6 GT TT
6 GT GG
6 GG GT
6 AA AC
5 TT CC
4 GG AA
4 CC CG
4 AA CA
3 CG CC
3 CC TT
3 AT TT
2 TT TA
1 TT GA
1 GG TG
1 GG GC
1 GG CG
1 GG CC
1 CG GG
1 CC GC
1 CC AA
1 AT AA
1 AA GG
1 G A
The first five lines make sense: the two analyses agree for a large proportion of the SNPs. The sixth shows 23andMe reading AG when BGI reads GA 30817 times. It looks like 23andMe are reporting unequal pairs in alphabetical order, while BGI are reporting them in random order. Taking these as matches, the great majority of SNPs reported by both are reported identically.
Then there are a few thousand SNPs that one or other analysis (in 26 cases, both) list in their output but don’t report anything for. What causes this?
Finally, there are a few hundred that the two analyses just give different results for. For most of these, one reports homozygosity for an allele present in the other, but in a few cases the reports are completely different, e.g. one occurrence of TT/GA.
Is this amount of mismatch typical for such analyses?
I received exactly the same number of SNPs from BGI, so it looks like our data were processed under the same pipeline. I’ve found three people who have publicly posted their BGI data: two at the Personal Genome Project (hu2FEC01 and hu41F03B, each with 5,095,048 SNPs), and one on a personal website (with 18,217,058 SNPs).
Then there are a few thousand SNPs that one or other analysis (in 26 cases, both) list in their output but don’t report anything for. What causes this?
The double dashes are no calls. 23andme reports on a set list of SNPs, and instead of omitting an SNP when they can’t confidently determine the genotype, they indicate this with a double dash.
Is this amount of mismatch typical for such analyses?
This seems normal considering the error rates from 23andme that others have been reporting (example). I don’t know about BGI’s error rates.
I think it might be possible to accurately guess the actual genotypes for some of the mismatches by imputing the genotypes with something like Impute2 (for each mismatched SNP, leave it out and impute it using the nearby SNPs). This will take many hours of work, though, and you might as well phase and impute across the whole genome if you have the time, interest, and processing power to do so (I’ve been meaning to try this out to learn more about how these things work).
I have my genome data from both 23andMe and BGI. I am wondering what to make of it. BGI reports about thirty times as many SNPs as 23andMe. 23andMe: 598897, BGI: 19695817.
Of these, 475801 are reported by both. I looked to see how well they agree with each other, and summarised the results as a count, for each occurring pair of results, of how often that pair occurred. In descending numerical order, and classifying them by type of match or mismatch, this is what I get. (No individual SNPs are identified here.)
The first five lines make sense: the two analyses agree for a large proportion of the SNPs. The sixth shows 23andMe reading AG when BGI reads GA 30817 times. It looks like 23andMe are reporting unequal pairs in alphabetical order, while BGI are reporting them in random order. Taking these as matches, the great majority of SNPs reported by both are reported identically.
Then there are a few thousand SNPs that one or other analysis (in 26 cases, both) list in their output but don’t report anything for. What causes this?
Finally, there are a few hundred that the two analyses just give different results for. For most of these, one reports homozygosity for an allele present in the other, but in a few cases the reports are completely different, e.g. one occurrence of TT/GA.
Is this amount of mismatch typical for such analyses?
Interesting. Thanks for posting this!
I received exactly the same number of SNPs from BGI, so it looks like our data were processed under the same pipeline. I’ve found three people who have publicly posted their BGI data: two at the Personal Genome Project (hu2FEC01 and hu41F03B, each with 5,095,048 SNPs), and one on a personal website (with 18,217,058 SNPs).
The double dashes are no calls. 23andme reports on a set list of SNPs, and instead of omitting an SNP when they can’t confidently determine the genotype, they indicate this with a double dash.
This seems normal considering the error rates from 23andme that others have been reporting (example). I don’t know about BGI’s error rates.
I think it might be possible to accurately guess the actual genotypes for some of the mismatches by imputing the genotypes with something like Impute2 (for each mismatched SNP, leave it out and impute it using the nearby SNPs). This will take many hours of work, though, and you might as well phase and impute across the whole genome if you have the time, interest, and processing power to do so (I’ve been meaning to try this out to learn more about how these things work).