Concepts – Imputation

Until recently, the word imputation wasn’t a part of the vocabulary of genetic genealogy, but earlier this year, it became a factor and will become even more important in coming months.

Illumina, the company that provides chips to companies that test autosomal DNA for genetic genealogy has obsoleted their OmniExpress chip previously in use, forcing companies to utilize their new Global Screening Array (GSA) chip when their current chip supply runs out.

Only about 20% of the DNA locations previously tested by genetic genealogy companies are tested on this new platform. Illumina has encouraged vendors to utilize the process called imputation to infer DNA results for their customers that are common in populations, but has not been directly tested in customer’s DNA, in order for vendors to achieve backwards compatibility with people previously tested on the OmniExpress chip. You can read the technical details of imputation in a document produced by Illumina here.

LivingDNA, who was developing and launching a new product during the transition time between chips was the first vendor out the gate with a GSA product. Illumina represented imputation to be “very accurate” to LivingDNA, which is consequently how they represented the results to a group of genetic genealogists on a conference call in early 2017. LivingDNA was the lucky company to have the opportunity to “work the bugs out” with Illumina – said with tongue firmly in cheek. LivingDNA provides a list of papers describing their methods here.

Another company, MyHeritage also uses imputation, for an entirely different reason. My Heritage uses imputation to “add” to the DNA results of people who upload results from different vendors. They are the first company to attempt DNA matching between people using imputation, and they initially had and continue to have matching issues. In their initial release blog in September 2016, they state that imputation matching “is accomplished with very high accuracy.” In their Q&A blog in November 2016, they state that “imputation may introduce errors so we are in the process of fine-tuning it.” They have made changes since matching was originally introduced, but they still struggle with matching accuracy, most recently discussed by Leah Larkin in her article, MyHeritage Matching.

DNA.LAND does not perform testing, but is a nonprofit in the health care industry who  utilizes imputation for health-related research – imputing approximately 38.3 million locations in addition to the 700,000 locations in customers’ uploaded files. In order to encourage people to upload their test results, DNA.LAND performs matching and ethnicity reporting. Like MyHeritage, their matching results are problematic. DNA.LAND explains about imputation and summarizes by stating that “any reported value should never be taken as-is without further careful analysis.” I will be publishing an article shortly about DNA.LAND.

23andMe, on August 9, 2017, released their V5 product utilizing the new GSA chip. They have not said how they are addressing the imputation challenge and backward compatibility. Several issues have been reported.

As you can see, the genetic genealogy landscape is changing and like it or not, imputation is a part of the new scenery.

What, Exactly, is Imputation?

Imputation is the process whereby your DNA is tested and then the results “expanded” by inferring results for additional locations, meaning locations that haven’t been tested, by using information from results you do have. In other words, the DNA is adjacent locations is predicted, or imputed, by their association with their traveling companions.  In DNA, traveling companions are often known to travel together, but not always.

Imputation is built upon two premises:

1 – that DNA locations are usually inherited together in groups in a process known as linkage disequilibrium.

2 – that people from common populations share a significant amount of the same DNA

An example that DNA.LAND provides is the following sentence.

I saw a blue ca_ on your head.

There are several letters that are more likely that others to be found in the blank and some words would be more likely to be found in this sentence than others.

A less intuitive sentence might be:

I saw a blue ca_ yesterday.

DNA.LAND also says very clearly that imputed values can be incorrect. They also state that the values inferred are the common values, not rare mutations, and imputed results are most accurate in Caucasian populations and least accurate in African populations whose DNA is the most variant of any continental group. They caution against using these results for medical diagnosis.

SNPedia (Promethease) cautions against using imputed results as well and suggests that files utilizing only tested results, without imputed results, are more accurate.

Why Imputation?

Looking at this Autosomal SNP Comparison Chart, provided by the ISOGG Wiki, you can see the difference in the number of actual common locations tested by the various vendors.

This means that companies that allow uploads from different vendors utilizing widely divergent chip results have to do something in order to successfully compare the disparate files against each other for matching. Using  23andMe as an example, even though they don’t allow uploads from other companies, they have to do something to accommodate matching between the new GSA V5 chip and their earlier V3 and V4 chips.

Imputation Example

Let’s take a look at how imputation is used to “equalize” files uploaded from various vendors that only contain marginal amounts of overlap.

I’m using MyHeritage as an example. Imputation, in this case, is utilized in an attempt to make marginally compatible files more compatible.

The files from the Ancestry V2 kit and the Family Tree DNA kit have only about 382,000 locations in common, meaning about 300,000 locations are not in common. In order to attempt to equalize these and other kits, MyHeritage attempts to use imputation to deduce the DNA that a tester would/should/might have in the missing segments, based on various statistical factors that include the tester’s population and existing DNA.

Please note that for purposes of concept illustration, I have shown all of the common locations, in blue, as contiguous. The common locations are not contiguous, but are scattered across the entire range that each vendor tests.

You can see that the number of imputed locations for matching between two people, shown in tan, is larger than the number of actual matching locations shown in blue. The amount of actual common data being compared is roughly 382,000 of 1,100,000 total locations, or 35%.

Stay tuned for an upcoming series of articles about imputation and results in various scenarios.

______________________________________________________________

Disclosure

I receive a small contribution when you click on some of the links to vendors in my articles. This does NOT increase the price you pay but helps me to keep the lights on and this informational blog free for everyone. Please click on the links in the articles or to the vendors below if you are purchasing products or DNA testing.

Thank you so much.

DNA Purchases and Free Transfers

Genealogy Services

Genealogy Research

47 thoughts on “Concepts – Imputation

  1. So the more “advanced” the chip, the less actual testing is done by the company but there is a significant increase in “educated guesses” for matching? If that’s true, why use any of it for genealogy? It seems they are adding more, not less uncertainty to the results. I hope that isn’t so!

    • It’s not actually less testing, it’s that the new Illumina chip is testing different locations and there are few in common with their earlier chip. So compatibility with previous versions is an issue because there is so little overlap. The companies are not happy campers, I’m sure, but they don’t have any choice since they depend on Illumina for equipment.

      • How is it that the companies have to go along with this change when they buy so many tests? I’m not happy about this new trend at all! I just sent in two tests to 23andMe for my husband’s parents. My child’s was done a few years ago. So, what does this mean? Will the differences affect the matches? It would seem so.
        And how does it affect Living DNA transfers. I’ve seen it reported that we will soon be able to transfer our DNA raw data to Living DNA. But, if the method they use will not be compatible to old tests, then will it be worthwhile to transfer our tests?
        Thanks for your article. Looking forward to your updates.

      • I think the new tests are designed for medical research which is probably a bigger market than DNA genealogy and that is why the changes were made.

  2. Hi Roberta – thanks for the “ughing” update. Just when I was starting to feel comfortable in comparing autosomal DNA results. You didn’t mention Gedmatch in your article….how are they approaching imputation when providing matches from a number of different labs?

    Thanks for keeping us on our toes!

    Deborah

  3. Thank you for explaining this Roberta.

    I find it quite worrying. I have a lot of matches who match me with a number of companies and the difference in results for any one match at My Heritage compared to the other companies is significantly different in most cases. Sometimes the My Heritage cM are double what the other companies are suggesting.

  4. Excellent parsing- cheers!

    The utilities on Gedmatch are also notorious for imputing and creating “zombie populations” in order to achieve their testing platform goals. In each scenario you will inevitably see a tester say “this is closest to what I should be in terms of ethnicity”, which is always met with a groan from me. Unless each of us are able to genetically confirm every line on each of our pedigrees, there is no way to know what our ethnicity should be parroting.

    With that said, I am an advocate of TribeCode, who did not impute. They dared to test full blocks of dna but paid the price as processing times and the enormity of the files forced them to re-evaluate.

    How different are these chips from each other in terms of results? The more admixed you are the more varied the results become. DNA.Land really lacks consistency despite the limited improvements made from the initial rollout.

    I would prefer to have results based upon my FGS as opposed to imputation, but that is a steep task!

  5. What will FTDNA do with Family Finder? If at all possible, it needs to maintain a “standard” kit using real SNPs.

    Imputation is a disaster for the field. Never have so many experts been so wrong. For example, take my mother’s kit. Using GEDmatch’s Tier 1 ‘one-to-many’, she has 15436 matches @7.0 cM minimum segment size. With DNA.land she has 4 high certainty matches (one is me) and 1 speculative.

    The problem is that mainstream genetic genealogy does not recognize that there is a huge amount of common ancestry within the past 300 years. Without taking that into account, the imputations are the old “garbage in, garbage out”.

    If you need some evidence of common ancestry, take your GEDmatch ‘X one-to-many’ list and start triangulating them. If two segments overlap by more than a handful of SNPs, then the logical union of those two segments has the same common ancestor. Before long, you will come to realize that virtually all X-matches point to “The Common Ancestor”.

    It is easy to brush this exercise aside, but if you do it (some males may not have enough matches), coming to any other conclusions is difficult. By the way, big matching X-DNA segments more than a few generations back are not IBD, but are reconstructed DNA from the CA via contributions from multiple lines to him.

    Jack Wyatt

  6. Roberta,

    Thanks so much for providing the information that we don’t necessarily get from the vendors until it actually happens!
    In one of your earlier question responses, you say that the new chip is testing different locations, but not less.
    What is the scientific reasoning behind testing these different locations? Is it because they have discovered that these locations offer more useful information because they show more variation than the other locations? Is there an agenda associated with the medical use of results? Are they trying to gather a significant amount of information on other locations so that ultimately they have a collection of data across the entire genome?
    I am very curious if this change is being driven scientifically or by business objectives.
    Do you have any insight on the motives behind Illumina changing the chip?
    Thanks,
    Alison

  7. This disturbs me. First of all, DNA tests are expensive for many, maybe for most. When I purchase a DNA test it puts a strain on my savings and I have to do without something else. I therefore, do NOT want to pay what I consider “too much” to a company only to get a “imputation”. I am paying for results of the test not a guess at a match. I have been considering another test but if the results are to be “impuned” it’s a NO BUY.

  8. Many Thanks Roberta,
    There is no doubt in my mind that bottom line is the driver. The SNP’s tested on the new chip must be designed for a different purpose than genealogy. So do genealogy testing are not the biggest market and must do the best they can with the available chips. Should we advise testing with companies are still using prior chip ASAP and be sure we archive all the data we now have?

    • Medical testing is by far, the largest market for DNA chips. Family Tree DNA is still utilizing their older chips. I don’t know about Ancestry other than they had a custom chip, but so did 23andMe.

  9. Thank you for this excellent article, Roberta.

    I was going to voice my extreme displeasure of any form of imputation from the viewpoint of my statistical background. One imputation is bad enough. But then comparing an imputation to an imputation more than doubles the chance of error.

    My hope is that the DNA companies decide to just use the actual data from the chip and don’t impute, and work to come out with other better statistical and scientific methods to allow match analysis of the locations in common between the new and old chip.

    The other comments really cover everything else I wanted to say.

  10. If the testing companies depend on Illumina equipment, then it stands to reason that Illumina depends on the testing companies to buy their product. If their product does not EXPAND on past testing, then why would the companies buy it? Why would they buy something that leaves their customers with more confusing, possibly/probably misleading information when comparing with older results? Are we looking at a situation of planned obsolescence? I’m going to sound cynical here, but if the testing companies don’t boycott this NEW method, and insist the OLD testing be included with the new (which would be the consumer’s logical assumption in an “upgrade”), then perhaps the testing companies aren’t concerned about results at all. They already make it difficult for consumers to access the information they have. They don’t seem concerned about meeting consumer needs. Maybe they are hoping people who have tested before will test again on the NEW platform, for a fee of course!

    • They really don’t have a choice in the matter. I really liked the old model Jeep better, and so did lots of people, but Jeep obsoleted it anyway. Either buy the new one or something else, and in the genetics marketspace, Illumina is the giant in the marketplace.

  11. This explains the varied results I’ve gotten from MyHeritage.com, FamilyTreeDNA.com, and Ancestry.com on a recent match. At MyHeritage.com, I recently discovered a new top ten match, one they labeled as having 87.7 cMs shared across 8 segments. After contacting this new cousin and investing several days trying to discover our common connection, we determined that he had transferred his results from Ancestry.com and I had transferred my results from FamilyTreeDNA.com. I’ve tested at all three (Ancestry.com, FamilyTreeDNA.com, and 23andme.com). We discovered that at Ancestry.com, where we both had tested, they were reporting only 8.8 cMs on 2 segments. (Damn Timber). At FamilyTreeDNA.com where my new cousin also transferred his Ancestry.com data, they reported 39 cMs and 9 segments with only one significant segment being 20 cMs. I’m much more willing to believe the results at FamilyTreeDNA.com, but I think MyHeritage.com has some serious growing pains yet to go through.

  12. Twenty percent! That is shocking. Illumina is just begging for competition. Once again, genealogy is considered a poor cousin, one that can be ignored in the exploding field of genetics.

  13. I always thought the testing had to be pretty random. But to what extent, I had no idea (and still don’t). I would like to know why these companies did not tell us just how random their results would be, before we, the consumers, spent all this money on testing! Perhaps at the beginning the companies expected the consumer to stay with the initial testing companies for comparisons. I don’t recall being promised the moon, but the moon was in view, after all!

    • I wouldn’t call the testing random. The vendors are doing the best they can under the circumstances. Strong matches are still strong matches. FTDNA has always been straightforward about why they only provide high level matches with kits transferred using other chips.

  14. “Imputation” sounds like a more sophisticated term than “implication”. Accuracy is not the goal here, prognostication is.The vendors only care that you continue to buy more kits. This field is drifting more and more toward one of reading tea leaves. People will see whatever they want to see in their reports.

  15. If those of us who have results with the older chips decide to upgrade when the option becomes available at 23and Me, do you think the ethnicity estimates would have be somewhat different since they will now be using imputation? Sorry if I still find this somewhat confusing.

  16. Hi

    I found that Ftdna, Myhertiage, getmatch and DNAland have all found my relative from a union which is 220 years ago.
    It is noted as a 2nd cousin in Ftdna .

    23andme and natgeno is zero value for me in regards to matching

    It is the only sample I have found and confirmed as 100%

    I now, ignore everything from 4th cousin onwards as they would be at least 500 years old and basically impossible to trace.

    regards on another fine post.
    victor

  17. In a word ‘imputation’ is a tautology argued under the guise of ‘science’. Imputation is anything but science. Looks like nothing but an add-on to keep the cashflow coming from the genetic genealogy and ethnicity market.

    BTW last week MyHertitage found a ‘good’ match between the FTDNA uploads for my mother and myself … only trouble being I shared twice as much cM and as many segments as my mother did. No, it was not a double paternal/maternal match as my paternal 1st cousin was a no match.

    I really do like the way MyHeritage present their DNA results so it is unfortunate that their results are problematic at times. (However, cannot say I have the same view for MyHeritager trees … Ancestry tree presentation with MyHertitage DNA presentation would be a great combo!)

  18. Pingback: Autosomal DNA Transfers – Which Companies Accept Which Tests? | DNAeXplained – Genetic Genealogy

  19. Roberta, I’m puzzled by the FTDNA vs. MyHeritage comparison of common autosomal SNPs in the table (which I realize is from the ISOGG wiki). If the cell for FTDNA vs. FTDNA is 698,179, how can the cell for FTDNA vs. MyHeritage be higher, at 702,442 (which is the same as MyHeritage vs. MyHeritage)? Am I misunderstanding something, or is this an error?

  20. Pingback: Imputation Matching Comparison | DNAeXplained – Genetic Genealogy

  21. FYI — I noticed what appears to be an error on the Autosomal SNP Comparison Chart on the ISOGG wiki site. The matrix shows the SNP count as 702,442 at the intersection of FTDNA and MyHeritage. This is higher than the number of SNPs tested on the FTDNA chip, which is 698,179 according to the chart.

  22. Pingback: Imputation Analysis Utilizing Promethease | DNAeXplained – Genetic Genealogy

  23. Is it feasible to ‘impute’ with close relatives? On GEDmatch, if I find a strong overlap with a close cousin that is using a different platform, then I should be able to incorporate all the additional SNPs into my own genotype with confidence. I’m new to this stuff, so I have two key questions

    – are there tools that would allow me to cut out a segment of a genotype file and merge it into my own

    – since my newly edited file would have an unique set of SNPs, would this franken-genotype be useable on any ancestry comparison site?

  24. Pingback: 2018 – The Year of the Segment | DNAeXplained – Genetic Genealogy

  25. I do group comparisons occasionally, and recently did one of the largest to date with multiple results from multiple test companies. When one person asked me what this meant, I began looking into her relationship with this family. Previously thought to be rock solid as a strong DNA match, it appeared she was not a match. Going back to her original testing company site, her results were quite different, and reconfirmed that she was the rock solid match with this family.
    So, I have cautioned my group about this, while I read up on it, and the best I can do at this point is to postpone any comparisons that involve DNA transfers or anything recent.
    This appears to be a huge setback, at a time when this industry was really expanding. And for those who were just beginning to work with DNA, I can only imagine the harm / setback it will be for them. It appears someone made some very poor decisions.

  26. Pingback: DNA: In Search of…Signs of Endogamy | DNAeXplained – Genetic Genealogy

Leave a Reply to Roberta EstesCancel reply