Until recently, the word imputation wasn’t a part of the vocabulary of genetic genealogy, but earlier this year, it became a factor and will become even more important in coming months.
Illumina, the company that provides chips to companies that test autosomal DNA for genetic genealogy has obsoleted their OmniExpress chip previously in use, forcing companies to utilize their new Global Screening Array (GSA) chip when their current chip supply runs out.
Only about 20% of the DNA locations previously tested by genetic genealogy companies are tested on this new platform. Illumina has encouraged vendors to utilize the process called imputation to infer DNA results for their customers that are common in populations, but has not been directly tested in customer’s DNA, in order for vendors to achieve backwards compatibility with people previously tested on the OmniExpress chip. You can read the technical details of imputation in a document produced by Illumina here.
LivingDNA, who was developing and launching a new product during the transition time between chips was the first vendor out the gate with a GSA product. Illumina represented imputation to be “very accurate” to LivingDNA, which is consequently how they represented the results to a group of genetic genealogists on a conference call in early 2017. LivingDNA was the lucky company to have the opportunity to “work the bugs out” with Illumina – said with tongue firmly in cheek. LivingDNA provides a list of papers describing their methods here.
Another company, MyHeritage also uses imputation, for an entirely different reason. My Heritage uses imputation to “add” to the DNA results of people who upload results from different vendors. They are the first company to attempt DNA matching between people using imputation, and they initially had and continue to have matching issues. In their initial release blog in September 2016, they state that imputation matching “is accomplished with very high accuracy.” In their Q&A blog in November 2016, they state that “imputation may introduce errors so we are in the process of fine-tuning it.” They have made changes since matching was originally introduced, but they still struggle with matching accuracy, most recently discussed by Leah Larkin in her article, MyHeritage Matching.
DNA.LAND does not perform testing, but is a nonprofit in the health care industry who utilizes imputation for health-related research – imputing approximately 38.3 million locations in addition to the 700,000 locations in customers’ uploaded files. In order to encourage people to upload their test results, DNA.LAND performs matching and ethnicity reporting. Like MyHeritage, their matching results are problematic. DNA.LAND explains about imputation and summarizes by stating that “any reported value should never be taken as-is without further careful analysis.” I will be publishing an article shortly about DNA.LAND.
23andMe, on August 9, 2017, released their V5 product utilizing the new GSA chip. They have not said how they are addressing the imputation challenge and backward compatibility. Several issues have been reported.
As you can see, the genetic genealogy landscape is changing and like it or not, imputation is a part of the new scenery.
What, Exactly, is Imputation?
Imputation is the process whereby your DNA is tested and then the results “expanded” by inferring results for additional locations, meaning locations that haven’t been tested, by using information from results you do have. In other words, the DNA is adjacent locations is predicted, or imputed, by their association with their traveling companions. In DNA, traveling companions are often known to travel together, but not always.
Imputation is built upon two premises:
1 – that DNA locations are usually inherited together in groups in a process known as linkage disequilibrium.
2 – that people from common populations share a significant amount of the same DNA
An example that DNA.LAND provides is the following sentence.
I saw a blue ca_ on your head.
There are several letters that are more likely that others to be found in the blank and some words would be more likely to be found in this sentence than others.
A less intuitive sentence might be:
I saw a blue ca_ yesterday.
DNA.LAND also says very clearly that imputed values can be incorrect. They also state that the values inferred are the common values, not rare mutations, and imputed results are most accurate in Caucasian populations and least accurate in African populations whose DNA is the most variant of any continental group. They caution against using these results for medical diagnosis.
SNPedia (Promethease) cautions against using imputed results as well and suggests that files utilizing only tested results, without imputed results, are more accurate.
Looking at this Autosomal SNP Comparison Chart, provided by the ISOGG Wiki, you can see the difference in the number of actual common locations tested by the various vendors.
This means that companies that allow uploads from different vendors utilizing widely divergent chip results have to do something in order to successfully compare the disparate files against each other for matching. Using 23andMe as an example, even though they don’t allow uploads from other companies, they have to do something to accommodate matching between the new GSA V5 chip and their earlier V3 and V4 chips.
Let’s take a look at how imputation is used to “equalize” files uploaded from various vendors that only contain marginal amounts of overlap.
I’m using MyHeritage as an example. Imputation, in this case, is utilized in an attempt to make marginally compatible files more compatible.
The files from the Ancestry V2 kit and the Family Tree DNA kit have only about 382,000 locations in common, meaning about 300,000 locations are not in common. In order to attempt to equalize these and other kits, MyHeritage attempts to use imputation to deduce the DNA that a tester would/should/might have in the missing segments, based on various statistical factors that include the tester’s population and existing DNA.
Please note that for purposes of concept illustration, I have shown all of the common locations, in blue, as contiguous. The common locations are not contiguous, but are scattered across the entire range that each vendor tests.
You can see that the number of imputed locations for matching between two people, shown in tan, is larger than the number of actual matching locations shown in blue. The amount of actual common data being compared is roughly 382,000 of 1,100,000 total locations, or 35%.
Stay tuned for an upcoming series of articles about imputation and results in various scenarios.