Imputation Analysis Utilizing Promethease

We know in the genetics industry that imputation is either coming or already here for genetic genealogy. I recently wrote two articles, here and here, explaining imputation and its (apparent) effects on matching – or at least the differences between vendors who do and don’t utilize imputation on the segments that are set forth as matches.

I will be writing shortly about my experience utilizing DNA.Land, a vendor who encourages testers to upload their files to be shared with medical researchers. In return, DNA.Land provides matching information and ethnicity – but they do impute results that you don’t have based on“typical” DNA that is generally inherited with the DNA you do have.

Aside from my own curiosity and interest in health, I have been attempting to determine the relative accuracy of imputation.

Promethease is a third party site that provides consumers who upload their autosomal DNA files with published information about their SNPs, mutations, either bad, good or neither, meaning just information. This makes Promethease the perfect avenue for comparing the accuracy of the imputed data provided by DNA.Land compared against the data provided by Promethease generated from files from vendors who do not impute.

Even better, I can directly compare the autosomal file from Family Tree DNA that I uploaded to DNA.Land with my resulting DNA.Land file after DNA.Land imputed another 38 million locations. I can also compare the DNA.Land results to an extensive exome test that provided results for some 50 million locations.

Uploading all of the files from various testing vendors separately to Promethease allows me to see which of the mutations imputed by DNA.Land are accurate when compared to actual DNA tests, and if the imputed mutations are accurate when the same location was tested by any vendor.

In addition to the typical genetic genealogy vendors, I’ve also had my DNA exome sequenced, which includes the 50 million locations in humans most likely to mutate.  This means those locations should be the locations most likely to be imputed by DNA.Land.

Finally, at Promethease, I can combine my results from all the vendors where I actually tested to provide the greatest coverage of actually tested locations, and then compare to DNA.Land – providing the most comprehensive comparison.

I will utilize the testing vendors’ actual results to check the DNA.Land imputed results.

Let’s see what the results produce.

The Test Process

The method I used for this comparison was to upload my Family Tree DNA autosomal raw data file to DNA.Land. DNA.Land then took the 700,000+ locations that I did test for at Family Tree DNA, and imputed more than 38 million additional locations, raising my tested and imputed number of locations to about 39 million.

Then, I downloaded and uploaded my huge DNA.Land file, utilizing the Promethease instructions.

In order to do a comparison against the imputed data that DNA.Land provided, I uploaded files from the following vendors individually, one at a time, to Promethease to see which versions of the files provided which results – meaning which mutations the files produced by actual testing at vendors could confirm in the DNA.Land imputed results.

  • DNA.Land (imputed)
  • Genos – Exome testing of 50 million medically relevant locations
  • Ancestry V1 test
  • Ancestry V2 test
  • Family Tree DNA
  • 23andMe V3 test
  • 23andMe V4 test
  • Combined file of all non-imputed vendor files

Promethease provides a wonderful feature that enables users to combine multiple vendors’ files into one run. As a final test, I combined all of my non-imputed files into one run in order to compare all of my non-imputed results, together, with DNA.Land’s imputed results.

Promethease provides results that fall into 3 categories:

  • Bad – red
  • Good – green
  • Grey – “not set” – neither bad nor good, just information

Promethease does not provide diagnoses of any form, just information from the published literature about various mutations and genetic markers and what has been found in research, with links to the sources through SNPedia.

Results

I compiled the following chart with the results of each individual file, plus a combined file made up of all of the non-imputed files.

The results are quite interesting.

The combined run that included all of the vendors files except for DNA.Land provided more “bad” results than the imputed DNA.Land file. 

I expected that the Genos exome test would have covered all of the locations tested by the three genetic genealogy vendors, but clearly not, given that the combined run provides more results than the Genos exome run by itself. In fact, the total locations reported is 80,607 for the combined run and the Genos run alone was only 45,595.

DNA.Land only imputed 34,743 locations that returned results.

Comparison for Accuracy

Now, the question is whether the DNA.Land imputed results are accurate.

Due to the sheer number of results, I focused only on the “bad” results, the ones that would be most concerning, to get an idea of how many of the DNA.Land results were tested in the original uploaded file (from FTDNA) and how many were imputed. Of the imputed locations, I determined how many are accurate by comparing the DNA.Land results to the combined testing results. My hope, is, of course, that most of the locations found in the DNA.Land imputed file are also to be found in one of the files tested at the vendors, and therefore covered in the combined file run.

I combined my results from the following 3 runs into a common spreadsheet, color coding each result differently:

  • First, I wanted to see the locations reported as “bad” that were actually tested at FTDNA. By comparing the FTDNA locations with the DNA.Land imputed file, we know that DNA.Land was NOT imputing those locations, and conversely, that they WERE imputing the rest of the locations.
  • Second, I wanted to know if locations imputed by DNA.Land and reported as “bad” had been tested by any testing company, and if DNA.Land’s imputation was accurate as compared to an actual test.

You can read more about how Promethease reports results, here.

I’m showing two results in the spreadsheet example, below.

White row=FTDNA test result
Yellow row =DNA.Land result
Blue row=combined test result

These two examples show two mutations that are ranked as “bad” for the same condition. This result really only tells me that I metabolize some things slower than other people. Reading the fine print tells me this as well:

The proportion of slow and rapid metabolizers is known to differ between different ethnic populations. In general, the slow metabolizer phenotype is most prevalent (>80%) in Northern Africans and Scandinavians, and lowest (5%) in Canadian Eskimos and Japanese. Intermediate frequencies are seen in Chinese populations (around 20% slow metabolizers), whereas 40 – 60% of African-Americans and most non-Scandinavian Caucasians are slow metabolizers.[PMID 16416399]

Many of you are probably slow metabolizers too.

I used this example to illustrate that not everything that is “bad” is going to keep you awake at night.

The first mutation, gs140 is found in the DNA.Land file, but there is no corresponding white row, representing the original Family Tree DNA report, meaning that DNA.Land imputed the result. GS140 is, however, tested by some vendor in the combined file. The results do match (verified by actually comparing the results individually) and therefore, the DNA.Land imputation was accurate as noted in the DNA.Land Analysis column at far right.

In the second example, gs154 is reported by DNA.Land, but since it’s also reported by Family Tree DNA in the white row, we know that this value was NOT imputed by DNA.Land, because this was part of the originally uploaded file. Therefore, in the Analysis column, I labeled this result as “tested at FTDNA.”

Analysis

I analyzed each of the rows of “bad” results found in the DNA.Land file by comparing them first to the FTDNA file and then the Combined file. In some cases, I needed to return to the various vendor results to see which vendor had done the testing on a specific location in order to verify the result from the individual run.

So, how did DNA.Land do with imputing data as compared with actual tested results?

# Results % Comment
Tested, not Imputed 171 38.6 This “bad” location was tested at FTDNA and uploaded, so we know it was reported accurately at DNA.Land and not imputed.
Total Imputed* 272 61.4 Meaning total of “bad” results not tested at FTDNA, so not uploaded to DNA.Land, therefore imputed.
Imputed Correctly 259 95.22 This result was verified to match a tested location in the combined run.
Imputed, but not tested elsewhere 6 2.21 Accuracy cannot be confirmed.
Conflict 3 1.10 DNA.Land results cannot be verified due to an error of some sort – two of these three are probably accurate.
Imputed Incorrectly 4 1.47 Confirmed by the combined run where the location was actually tested at multiple vendor(s).
Not reported, and should have been 1 0.37 4 other vendor tests showed this mutation, including FTDNA which was uploaded to DNA.Land. Therefore these locations should have been reported by the DNA.Land file.

*The total number of “bad” results was 443, 171 that were tested and 272 that were imputed. Note that the percentages of imputations shown below the “Total Imputed” number of 272 are calculated based on the number of locations imputed, not on the total number of locations reported.

Concerns, Conflicts and Errors

It’s worth noting that my highest imputed “bad” risk from DNA.Land was not tested elsewhere, so cannot be verified, which concerns me.

On the three results where a conflict exists, all 3 locations were tested at multiple other vendors, and the results at the other vendors where the results were actually tested show different results from each other, which means that the DNA.Land result cannot be verified as accurate. Clearly, an error exists in at least one of the other tests.

In one conflict case, this error has occurred at 23andMe on either their V3 or V4 chip, where the results do not match each other.

In a second conflict case, two of the other vendors agree and the DNA.Land imputation is likely accurate, as it matches 2 of the three other vendor tests.

In the third conflict case, the Ancestry V2 test confirms one of the 23andMe results, which matches the DNA.Land results, so the DNA.Land result is likely accurate.

Of the 4 results that were confirmed to be imputed incorrectly, all locations were tested at multiple vendors. In two cases, the location was confirmed on two other tests and in the other two cases, the location was tested at three vendors. The testing vendor’s results all matched each other.

Summary

Overall, given the problems found with both DNA.Land and MyHeritage, who both impute, relative to genetic genealogy matching, I was surprised to find that the DNA.Land imputed health results were relatively accurate.

I expected the locations reported in the FTDNA file to be reported accurately by DNA.Land, because that data was provided to them. In one case, it was not.

Of the 272 “bad” results imputed, 259, or 95.22% could be verified as accurate.

Six could not be verified, and three were in conflict, but of those, it’s likely that two of the three were imputed accurately by DNA.Land. The third can’t be verified. This totals 3.31% of the imputed results that are ambiguous.

Only 1.47% were imputed incorrectly. If you add the .37% for the location that was not reported and should have been, and make the leap of assumption that the one of three in conflict is in error, DNA.Land is still just over a 2% confirmed error rate.

I can see why Illumina would represent to the vendors that imputation technology is “very accurate.” “Very” of course is relative, pardon the pun, in genetic genealogy, to how well matching occurs, not only when the new GSA chip is compared to another GSA chip, but when the new GSA version is compared to the older OmniExpress version. For backards compatibility between the chip versions, imputation must be utilized. Thanks a lot Illumina (said in my teenage sarcastic voice).

Since DNA.Land accepts files from all the vendors on all chips, for DNA.Land to be able to compare all locations in all vendors’ files against each other, the “missing” data in each file must be imputed. MyHeritage is doing something similar (having hired one of the DNA.Land developers), and both vendors have problems with genetic genealogy matching.

This begs the question of why the matching is demonstrably so poor for genetic genealogy. I’ve written about this phenomenon here, Kitty Cooper wrote about it here and Leah Larkin here.

Based on this comparison, each individual DNA.Land imputed file would contain about a 2% error rate of incorrectly imputed data, assuming the error rate is the same across the entire file, so a combined total of 4% for two individuals, if you’re just looking at individual SNPs. Perhaps entire segments are being imputed incorrectly, given that we know that DNA is inherited in segments. If that is the case, and these individual SNPs are simply small parts of entire segments that are imputed incorrectly, they might account for an equal number of false positive matches. In other words, if 10 segments are imputed incorrectly for me, that’s 10 segments reporting false positive matches I’ll have when paired against anyone who receives the same imputed data. However, that doesn’t explain the matches that are legitimate (on tested segments) and aren’t found by the imputing vendors, and it doesn’t explain an erroneous match rate that appears to be significantly higher than the 2-4% per cent found in this comparison.

I’ll be writing about the DNA.Land matching comparison experience shortly.

I would strongly prefer that medical research be performed on fully tested individuals. I realize that the cost of encouraging consumers to upload their data, and then imputing additional information is much less expensive than actual testing. However, accuracy is an issue and a 2% error rare, if someone is dealing with life-saving and life-threatening research could be a huge margin of error, from the beginning of the project, based on faulty imputation – which could be eliminated by simply testing people. This seems like an unnecessary risk and faulty research just waiting to happen. This error rate is on top of the actual sequencing error rate, but sequencing errors will be found in different locations in individuals, not on the same imputed segment assigned to multiple people in population groups. Imputation errors could be cumulative in one location, appearing as a hot spot when in reality, it’s an imputation error.

As related to genetic genealogy, I don’t think imputation and genetic genealogy are good bedfellows. DNA.Land’s matching was even worse when it was initially introduced, which is one reason I’ve waited so long to upload and write about the service.

Unfortunately, with Illumina obsoleting the OmniExpress chip, we’re not going to have a choice, sooner than later. All vendors who utilized the OmniExpress chip are being forced off, either onto the GSA chip or to an Exome or full sequence chip. The cost of sequencing for anything other than the GSA chip is simply more than the genetic genealogy market will stand, not to mention even larger compatibility issues. My Genos Exome test cost $499 just a few months ago and still sells for that price today.

The good news is that utilizing imputation, we will still receive matches, just less accurate matches when comparing the new chip to older versions, and when using imputation.

New testers will never know the difference. Testers not paying close attention won’t notice or won’t realize either. That leaves the rest of us “old timers” who want increased accuracy and specification, not less, flapping in the wind along with the vendors who don’t sell our test results into the medical arena and have no reason to move to the new GSA platform other than Illumina obsoleting the OmniExpress chip.

Like I said, thanks Illumina.

Imputation Matching Comparison

In a future article, I’ll be writing about the process of uploading files to DNA.Land and the user experience, but in this article, I want to discuss only one topic, and that’s the results of imputation as it affects matching for genetic genealogy. DNA.Land is one of three companies known positively to be using imputation (DNA.Land, MyHeritage and LivingDNA), and one of two that allows transfers and does matching for genealogy

This is the second in a series of three articles about imputation.

Imputation, discussed in the article, Concepts – Imputation, is the process whereby your DNA that is tested is then “expanded” by inferring results you don’t have, meaning locations that haven’t been tested, by using information from results you do have. Vendors have no choice in this matter, as Illumina, the chip maker of the DNA chip widely utilized in the genetic genealogy marketspace has obsoleted the prior chip and moved to a new chip with only about 20% overlap in the locations previously tested. Imputation is the methodology utilized to attempt to bridge the gap between the two chips for genetic genealogy matching and ethnicity predications.

Imputation is built upon two premises:

1 – that DNA locations are inherited together

2 – that people from common populations share a significant amount of the same DNA

An example of imputation that DNA.Land provides is the following sentence.

I saw a blue ca_ on your head.

There are several letters that are more likely that others to be found in the blank and some words would be more likely to be found in this sentence than others.

A less intuitive sentence might be:

I saw a blue ca_ yesterday.

DNA.Land doesn’t perform DNA testing, but instead takes a file that you upload from a testing vendor that has around 700,000 locations and imputes another 38.3 million variants, or locations, based on what other people carry in neighboring locations. These numbers are found in the SNPedia instructions for uploading DNA.Land information to their system for usage with Promethease.

I originally wrote about Promethease here, and I’ll be publishing an updated article shortly.

In this article, I want to see how imputation affects matching between people for genetic genealogy purposes.

Genetic Genealogy Matching

In order to be able to do an apples to apples comparison, I uploaded my Family Tree DNA autosomal file to DNA.Land.

DNA.Land then processed my file, imputed additional values, then showed me my matches to other people who have also uploaded and had additional locations imputed.

DNA.Land has just over 60,000 uploads in their data base today. Of those, I match 11 at a high confidence level and one at a speculative level.

My best match, meaning my closest match, Karen, just happened to have used her GedMatch kit number for her middle name. Smart lady!

Karen’s GedMatch number provided me with the opportunity to compare our actual match information at DNA.Land, then also at GedMatch, then compare the two different match results in order to see how much of our matching was “real” from portions of our tested kits that actually match, and what portion of our DNA matches as a result of the DNA.Land imputation.

At DNA.Land, your match information is presented with the following information:

  • Relationship degree – meaning estimated relationship
  • # shared segments – although many of these are extremely small
  • Total shared cM
  • Total recent shared length in cM
  • Longest recent shared segment in cM
  • Relationship likelihood graph
  • Shared segments plotted on chromosome display
  • Shared segments in a table

Please note that you can click on any graphic to enlarge.

DNA.Land provides what they believe to be an accurate estimate of recent and anciently shared SNA segments.

The match table is a dropdown underneath the chromosome graphic at far right:

For this experiment, I copied the information from the match table and dropped it into a spreadsheet.

DNALand Match Locations

My match information is shown at DNA.Land with Karen as follows:

Matching segments are identified by DNA.Land as either recent or ancient, which I find to be over-simplified at best and misleading or inaccurate at worst. I guess it depends on how you perceive recent and ancient. I think they are trying to convey the concept that larger segments tend to me more recent, and smaller segments tend to be older, but ancient in the genetics field often refers to DNA extracted from exhumed burials from thousands of years ago.  Furthermore, smaller segments can be descended from the same ancestor as larger segments.

GedMatch Match

Since Karen so kindly provided her GedMatch kit number, I signed in to GedMatch and did a one-to-one match with this same kit.

Since all of the segments are 3 cM and over at DNA.Land, I utilized a GedMatch threshold of 3 cM and dropped the SNP count to 100, since a SNP count of 300 gave me few matches. For this comparison, I wanted to see all my matches to Karen, no matter how few SNPs are involved, in an attempt to obtain results similar to DNA.Land. I normally would not drop either of these thresholds this low. My typical minimum is 5cM and 500 SNPs, and even if I drop to 3cM, I still maintain the 500 SNP threshold.

Let’s see how the data from GedMatch and DNA.Land compares.

In my spreadsheet, below, I pasted the segment match information from DNA.Land in the first 5 columns with a red header. Note that DNA.Land does not provide the number of shared SNPs.

At right, I pasted the match information from GedMatch, with a green header. We know that GedMatch has a history of accurately comparing segments, and we can do a cross platform comparison. I originally uploaded my FTDNA file to DNA.Land and Karen uploaded an Ancestry file. Those are the two files I compared at GedMatch, because the same actual matching locations are being compared at both vendors, DNA.Land (in addition to imputed regions) and GedMatch.

I then copied the matching segments from GedMatch (3cM, 100 SNPs threshold) and placed them in the middle columns in the same row where they matched corresponding DNA.Land segments. If any portion of the two vendors segments overlapped, I copied them as a match, although two are small and partial and one is almost negligible. As you can see, there are only 10 segments with any overlap at all in the center section. Please note that I am NOT suggesting these are valid or real matches.  At this point, it’s only a math/match exercise, not an analysis.

The match comparison column (yellow header) is where I commented on the match itself. In some cases, the lack of the number of SNPs at DNA.Land was detrimental to understanding which vendor was a higher match. Therefore, when possible, I marked the higher vendor in the Match Comparison column with the color of their corresponding header.

Analysis

Frankly, I was shocked at the lack of matching between GedMatch and DNA.Land. Trying to understand the discrepancy, I decided to look at the matches between Karen, who has been very helpful, and me at other vendors.

I then looked at our matches at Ancestry, 23andMe, MyHeritage and at Family Tree DNA.

The best comparison would be at Family Tree DNA where Karen loaded her Ancestry file.  Therefore, I’m comparing apples to apples, meaning equivalent to the comparison at GedMatch and DNA.Land (before imputation).

It’s impossible to tell much without a chromosome browser at Ancestry, especially after Timber processing which reduces matching DNA.

DNA.Land categorized my match to Karen as “high certainty.” My match with Karen appears to be a valid match based on the longest segment(s) of approximately 30cM on chromosome 8.

  • Of the 4 segments that DNA.Land identifies as “recent” matches, 2 are not reflected at all in the GedMatch or Family Tree DNA matching, suggesting that these regions were imputed entirely, and incorrectly.
  • Of the 4 segments that DNA.Land identifies as “recent” matches, the 2 on chromosome 8 are actually one segment that imputation apparently divided. According to DNA.LAND, imputation can increase the number of matching segments. I don’t think it should break existing segments, meaning segments actually tested, into multiple pieces. In any event, the two vendors do agree on this match, even though DNA.Land breaks the matching segment into two pieces where GedMatch and Family Tree DNA do not. I’m presuming (I hate that word) that this is the one segment that Ancestry calls as a match as well, because it’s the longest, but Ancestry’s Timber algorithm downgrades the match portion of that segment by removing 11cM (according to DNA.Land) from 29cM to 18cM or removes 13cM (according to both GedMatch and Family Tree DNA) from 31cM to 18cM. Both GedMatch and Family Tree DNA agree and appear to be accurate at 31cM.
  • Of the total 39 matching segments of any size, utilizing the 3cM threshold and 100 SNPs, which I set artificially very low, GedMatch only found 10 matching segments with any portion of the segment in common, meaning that at least 29 were entirely erroneous matches.
  • Resetting the GedMatch match threshold to 3 cM and 300 SNPS, a more reasonable SNP threshold for 3cM, GedMatch only reports 3 matching segments, one of which is chromosome 8 (undivided) which means at this threshold, 36 of the 39 matching DNA.Land segments are entirely erroneous. Setting the threshold to a more reasonable 5cM or 7cM and 500 SNPs would result in only the one match on chromosome 8.

  • If 29 of 39 segments (at 3cM 100 SNPs) are erroneously reported, that equates to 74.36% erroneous matches due to imputation alone, with out considering identical by chance (IBC) matches.
  • If 35 of 39 segments (at 3cM 300 SNPs) are erroneously reported, that equates to 89.74% percent erroneous matches, again without considering those that might be IBC.

Predicted vs Actual

One additional piece of information that I gathered during this process is the predicted relationship.

Vendor Total cM Total Segments Longest Segment Predicted Relationship
DNA.Land 162 to 3 cM 39 to 3 cM 17.3 & 12, split 3C
GedMatch 123 to 3 cM 27 to 3 cM 31.5 5.1 gen distant
Family Tree DNA 40 to 1 cM 12 to 1 cM 32 3-5C
MyHeritage No match No match No match No match
Ancestry 18.1 1 18.1 5-8C
23andMe 26 1 26 3-6C

Karen utilized her Ancestry file and I used my Family Tree DNA file for all of the above matching except at 23andMe and Ancestry where we are both tested on the vendors’ platform. Neither 23andMe nor Ancestry accept uploads. I included the 23andMe and Ancestry comparisons as additional reference points.

The lack of a match at MyHeritage, another company that implements imputation, is quite interesting. Karen and I, even with a significantly sized segment are not shown as a match at MyHeritage.

If imputation actually breaks some matching segments apart, like the chromosome 8 segment at DNA.Land, it’s possible that the resulting smaller individual segments simply didn’t exceed the MyHeritage matching threshold. It would appear that the MyHeritage matching threshold is probably 9cM, given that my smallest segment match of all my matches at MyHeritage is 9cM. Therefore, a 31 or 32 cM segment would have to be broken into 4 roughly equally sized pieces (32/4=8) for the match to Karen not to be detected because all segment pieces are under 9cM. MyHeritage has experienced unreliable matching since their rollout in mid 2016, so their issue may or may not be imputation related.

The Common Ancestor

At Family Tree DNA, Karen does not match my mother, so I can tell positively that she is related through my father’s line. She and I triangulate on our common segment with three other individuals who descend from Abraham Estes 1647-1720 .

Utilizing the chromosome browser, we do indeed match on chromosome 8 on a long segment, which is also our only match over 5cM at Family Tree DNA.

Based on our trees as well as the trees of our three triangulated Estes matches, Karen and I are most probably either 8th cousins, or 8th cousins once removed, assuming that is our only common line. I am 8th cousins with the other three triangulated matches on chromosome 8. Karen’s line has yet to be proven.

Imputation Matching Summary

I like the way that DNA.Land presents some of their features, but as for matching accuracy, you can view the match quality in various ways:

  1. DNA.Land did find the large match on chromosome 8. Of course, in terms of matching, that’s pretty difficult to miss at roughly 30cM, although MyHeritage managed. Imputation did split the large match into two, somehow, even though Karen and I match on that same segment as one segment at other vendors comparing the same files.
  2. Of the 39 DNA.Land total matches, other than the chromosome 8 match, two other matches are partial matches, according to GedMatch. Both are under 7cM.
  3. Of DNA.Land’s total 39 matches, 35 are entirely wrong, in addition to the two that are split, including two inaccurate imputed matches at over 5cM.
  4. At DNA.Land, I’m not so concerned about discerning between “real” and “false” small segment matches, as compared to both FTDNA and GedMatch, as I am about incorrectly imputed segments and matches. Whether small matches in general are false positives or legitimate can be debated, each smaller segment match based on its own merits. Truthfully, with larger segments to deal with, I tend to ignore smaller segments anyway, at least initially. However, imputation adds another layer of uncertainty on top of actual matching, especially, it appears, with smaller matches. Imputing entire segments of incorrect DNA concerns me.
  5. Having said that, I find it very concerning that MyHeritage who also utilizes imputation missed a significant match of over 30cM. I don’t know of a match of this size that has ever been proven to be a false match (through parental phasing), and in this case, we know which ancestor this segment descends from through independent verification utilizing multiple other matches. MyHeritage should have found that match, regardless of imputation, because that match is from portions of the two files that were both tested, not imputed.

Summary

To date, I’m not impressed with imputation matching relative to genetic genealogy at either DNA.Land or MyHeritage.

In one case, that of DNA.Land, imputation shows matches for segments that are not shown as matches at either Family Tree DNA or GedMatch who are comparing the same two testers’ files, but without imputation. Since DNA.Land did find the larger segment, and many of their smaller segments are simply wrong, I would suggest that perhaps they should only show larger segments. Of course, anyone who finds DNA.Land is probably an experienced genetic genealogist and probably already has files at both GedMatch and Family Tree DNA, so hopefully savvy enough to realize there are issues with DNA.Land’s matching.

In the second imputation case, that of MyHeritage, the match with Karen is missed entirely, although that may not be a function of imputation. It’s hard to determine.  MyHeritage is also comparing the same two files uploaded by Karen and I to the other vendors who found that match, both vendors who do and don’t utilize imputation.

Regardless of imputing additional locations, MyHeritage should have found the matching segment on chromosome 8 because that region does NOT need to be imputed. Their failure to do so may be a function of their matching routine and not of imputation itself. At this point, it’s impossible to discern the cause. We only know, based on matching at other vendors, that the non-match at MyHeritage is inaccurate.

Here’s what DNA.Land has to say about the imputed VCF file, which holds all of your imputed values, when you download the file. They pull no punches about imputation.

“Noisey and probabilistic.” Yes, I’d say they are right, and problematic as well, at least for genetic genealogists.

Extrapolating this even further, I find it more than a little frightening that my imputed data at DNA.Land will be utilized for medical research.

Quoting now from Promethease, a medical reference site that allows the consumer to upload their raw data files, providing consumers with a list of SNPs having either positive or negative research in academic literature:

DNA.land will take a person’s data as produced by such companies and impute additional variants based on population frequency statistics. To put this in concrete terms, a person uploading a typical 23andMe file of ~700,000 variants to DNA.land will get back an (imputed) file of ~39 million variants, all predicted to be present in the person. Promethease reports from such imputed files typically contain about 50% more information (i.e. 50% more genotypes) than the corresponding reports from raw (non-imputed) data.

Translated, this means that your imputed data provides twice as much “genetic information” as your actual tested data. The question remains, of course, how much of this imputed data is accurate.

That will be the topic of the third imputation article. Stay tuned.

_____________________________________________________________________

Standard Disclosure

This standard disclosure appears at the bottom of every article in compliance with the FTC Guidelines.

Hot links are provided to Family Tree DNA, where appropriate. If you wish to purchase one of their products, and you click through one of the links in an article to Family Tree DNA, or on the sidebar of this blog, I receive a small contribution if you make a purchase. Clicking through the link does not affect the price you pay. This affiliate relationship helps to keep this publication, with more than 850 articles about all aspects of genetic genealogy, free for everyone.

I do not accept sponsorship for this blog, nor do I write paid articles, nor do I accept contributions of any type from any vendor in order to review any product, etc. In fact, I pay a premium price to prevent ads from appearing on this blog.

When reviewing products, in most cases, I pay the same price and order in the same way as any other consumer. If not, I state very clearly in the article any special consideration received. In other words, you are reading my opinions as a long-time consumer and consultant in the genetic genealogy field.

I will never link to a product about which I have reservations or qualms, either about the product or about the company offering the product. I only recommend products that I use myself and bring value to the genetic genealogy community. If you wonder why there aren’t more links, that’s why and that’s my commitment to you.

Thank you for your readership, your ongoing support and for purchasing through the affiliate link if you are interested in making a purchase at Family Tree DNA.

Concepts – Imputation

Until recently, the word imputation wasn’t a part of the vocabulary of genetic genealogy, but earlier this year, it became a factor and will become even more important in coming months.

Illumina, the company that provides chips to companies that test autosomal DNA for genetic genealogy has obsoleted their OmniExpress chip previously in use, forcing companies to utilize their new Global Screening Array (GSA) chip when their current chip supply runs out.

Only about 20% of the DNA locations previously tested by genetic genealogy companies are tested on this new platform. Illumina has encouraged vendors to utilize the process called imputation to infer DNA results for their customers that are common in populations, but has not been directly tested in customer’s DNA, in order for vendors to achieve backwards compatibility with people previously tested on the OmniExpress chip. You can read the technical details of imputation in a document produced by Illumina here.

LivingDNA, who was developing and launching a new product during the transition time between chips was the first vendor out the gate with a GSA product. Illumina represented imputation to be “very accurate” to LivingDNA, which is consequently how they represented the results to a group of genetic genealogists on a conference call in early 2017. LivingDNA was the lucky company to have the opportunity to “work the bugs out” with Illumina – said with tongue firmly in cheek. LivingDNA provides a list of papers describing their methods here.

Another company, MyHeritage also uses imputation, for an entirely different reason. My Heritage uses imputation to “add” to the DNA results of people who upload results from different vendors. They are the first company to attempt DNA matching between people using imputation, and they initially had and continue to have matching issues. In their initial release blog in September 2016, they state that imputation matching “is accomplished with very high accuracy.” In their Q&A blog in November 2016, they state that “imputation may introduce errors so we are in the process of fine-tuning it.” They have made changes since matching was originally introduced, but they still struggle with matching accuracy, most recently discussed by Leah Larkin in her article, MyHeritage Matching.

DNA.LAND does not perform testing, but is a nonprofit in the health care industry who  utilizes imputation for health-related research – imputing approximately 38.3 million locations in addition to the 700,000 locations in customers’ uploaded files. In order to encourage people to upload their test results, DNA.LAND performs matching and ethnicity reporting. Like MyHeritage, their matching results are problematic. DNA.LAND explains about imputation and summarizes by stating that “any reported value should never be taken as-is without further careful analysis.” I will be publishing an article shortly about DNA.LAND.

23andMe, on August 9, 2017, released their V5 product utilizing the new GSA chip. They have not said how they are addressing the imputation challenge and backward compatibility. Several issues have been reported.

As you can see, the genetic genealogy landscape is changing and like it or not, imputation is a part of the new scenery.

What, Exactly, is Imputation?

Imputation is the process whereby your DNA is tested and then the results “expanded” by inferring results for additional locations, meaning locations that haven’t been tested, by using information from results you do have. In other words, the DNA is adjacent locations is predicted, or imputed, by their association with their traveling companions.  In DNA, traveling companions are often known to travel together, but not always.

Imputation is built upon two premises:

1 – that DNA locations are usually inherited together in groups in a process known as linkage disequilibrium.

2 – that people from common populations share a significant amount of the same DNA

An example that DNA.LAND provides is the following sentence.

I saw a blue ca_ on your head.

There are several letters that are more likely that others to be found in the blank and some words would be more likely to be found in this sentence than others.

A less intuitive sentence might be:

I saw a blue ca_ yesterday.

DNA.LAND also says very clearly that imputed values can be incorrect. They also state that the values inferred are the common values, not rare mutations, and imputed results are most accurate in Caucasian populations and least accurate in African populations whose DNA is the most variant of any continental group. They caution against using these results for medical diagnosis.

SNPedia (Promethease) cautions against using imputed results as well and suggests that files utilizing only tested results, without imputed results, are more accurate.

Why Imputation?

Looking at this Autosomal SNP Comparison Chart, provided by the ISOGG Wiki, you can see the difference in the number of actual common locations tested by the various vendors.

This means that companies that allow uploads from different vendors utilizing widely divergent chip results have to do something in order to successfully compare the disparate files against each other for matching. Using  23andMe as an example, even though they don’t allow uploads from other companies, they have to do something to accommodate matching between the new GSA V5 chip and their earlier V3 and V4 chips.

Imputation Example

Let’s take a look at how imputation is used to “equalize” files uploaded from various vendors that only contain marginal amounts of overlap.

I’m using MyHeritage as an example. Imputation, in this case, is utilized in an attempt to make marginally compatible files more compatible.

The files from the Ancestry V2 kit and the Family Tree DNA kit have only about 382,000 locations in common, meaning about 300,000 locations are not in common. In order to attempt to equalize these and other kits, MyHeritage attempts to use imputation to deduce the DNA that a tester would/should/might have in the missing segments, based on various statistical factors that include the tester’s population and existing DNA.

Please note that for purposes of concept illustration, I have shown all of the common locations, in blue, as contiguous. The common locations are not contiguous, but are scattered across the entire range that each vendor tests.

You can see that the number of imputed locations for matching between two people, shown in tan, is larger than the number of actual matching locations shown in blue. The amount of actual common data being compared is roughly 382,000 of 1,100,000 total locations, or 35%.

Stay tuned for an upcoming series of articles about imputation and results in various scenarios.

Autosomal DNA Transfers – Which Companies Accept Which Tests?

Somehow, I missed the announcement that Family Tree DNA now accepts uploads from MyHeritage.

Other people may have missed a few announcements too, or don’t understand the options, so I’ve created a quick and easy reference that shows which testing vendors’ files can be uploaded to which other vendors.

Why Transfer?

Just so that everyone is on the same page, if you test your autosomal DNA at one vendor, Vendor A, some other vendors allow you to download your raw data file from Vendor A and transfer your results to their company, Vendor B.  The transfer to Vendor B is either free or lower cost than testing from scratch.  One site, GedMatch, is not a testing vendor, but is a contribution/subscription comparison site.

Vendor B then processes your DNA file that you imported from Vendor A, and your results are then included in the database of Vendor B, which means that you can obtain your matches to other people in Vendor B’s data base who tested there originally and others who have also transferred.  You can also avail yourself of any other tools that Vendor B provides to their customers.  Tools vary widely between companies.  For example, Family Tree DNA, GedMatch and 23andMe provide chromosome browsers, while Ancestry does not.  All 3 major vendors (Family Tree DNA, Ancestry and 23andMe) have developed unique offerings (of varying quality) to help their customers understand the messages that their unique DNA carries.

Ok, Who Loves Whom?

The vendors in the left column are the vendors performing the autosomal DNA tests. The vendor row (plus GedMatch) across the top indicates who accepts upload transfers from whom, and which file versions. Please consider the notes below the chart.

(Chart updated September 28, 2017)

Please note that on August 9, 2017, 23and Me began processing on the Illumina GSA chip which is not compatible with earlier versions.  As of late September 2017, only GedMatch accepts their upload and only in their Genesis sandbox area, not the normal production matching area.  This is due to the small overlap area with existing chips.  You can read more about the GSA chip and its ramifications here

  • Family Tree DNA accepts uploads from both other major vendors (Ancestry and 23andMe) but the versions that are compatible with the chip used by FTDNA will have more matches at Family Tree DNA. 23andMe V3, Ancestry V1 and MyHeritage results utilize the same chip and format as FTDNA. 23andMe V4 and Ancestry V2 utilize different formats utilizing only about half of the common locations. Family Tree DNA still allows free transfers and comparisons with other testers, but since there are only about half of the same DNA locations in common with the FTDNA chip, matches will be fewer. Additional functions can be unlocked for a one time $19 fee.
  • Neither Ancestry, 23andMe nor Genographic accept transfer data from any other vendors.
  • MyHeritage does accept transfers, although that option is not easy to find. I checked with a MyHeritage representative and they provided me with the following information:  “You can upload an autosomal DNA file from your profile page on MyHeritage. To access your profile page, login to your MyHeritage account, then click on your name which is displayed towards the top right corner of the screen. Click on “My profile”. On the profile page you’ll see a DNA tab, click on the tab and you’ll see a link to upload a file.”  MyHeritage has also indicated that they will be making ethnicity results available to individuals who transfer results into their system in May, 2017.
  • LivingDNA has just released an ethnicity product and does not have DNA matching capability to other testers.  Living DNA imputes DNA locations that they don’t test, but the initial download only includes the DNA locations actually tested.
  • WeGene’s website is in Chinese and they are not a significant player, but I did include them because GedMatch accepts their files. WeGene’s website indicates that they accept 23andme uploads, but I am unable to determine which version or versions. Given that their terms and conditions and privacy and security information are not in English, I would be extremely hesitant before engaging in business. I would not be comfortable in trusting on online translation for this type of document. SNPedia reports that WeGene has data quality issues.
  • GedMatch is not a testing vendor, so has no entry in the left column, but does provide tools and accepts all versions of files from each vendor that provides files, to date, with the exception of the Genographic Project.  GedMatch is free (contribution based) for many features, but does have more advanced functions available for a $10 monthly subscription. The GedMatch Genesis platform is a sandbox area for files from vendors that cannot be put into production today due to matching and compatibility issues.
  • The Genographic Project tested their participants at the Family Tree DNA lab until November 2016, when they moved to the Helix platform, which performs an exome test using a different chip.
  • The Ancestry V2 chip began processing in May 2016.
  • The 23andMe V3 chip began processing in December 2010. The 23andMe V4 chip began processing in November 2013. Their V5 chip August 9, 2017.

Incompatible Files

Please be aware that vendors that accept different versions of other vendors files can only work with the tested locations that are in the files generated by the testing vendors unless they use a technique called imputation.

For example, Family Tree DNA tests about 700,000 locations which are on the same chip as MyHeritage, 23andMe V3 and Ancestry V1. In the later 23andMe V4 test, the earlier 23andMe V2 and the Ancestry V2 tests, only a portion of the same locations are tested.  The 23andMe V4 and Ancestry V2 chips only test about half of the file locations of the vendors who utilize the Illumina OmniExpress chip, but not the same locations as each other since both the Ancestry V2 and 23andMe V4 chips are custom. 23andMe and Ancestry both changed their chips from the OmniExpress version and replaced genealogically relevant locations with medically relevant locations, creating a custom chip.

Update:  In August 2017, 23andMe introduced their V5 chip which has only about 20% overlap with previous chips.

I know this is confusing, so I’ve created the following chart for chip and test compatibility comparison.

(Chart updated Sept. 28, 2017)

You can easily see why the FTDNA, Ancestry V1, 23andMe V3 and MyHeritage tests are compatible with each other.  They all tested utilizing the same chip.  However, each vendor then applies their own unique matching and ethnicity algorithms to customer results, so your results will vary with each vendor, even when comparing ethnicity predictions or matching the same two individuals to each other.

Apples to Apples to Imputation

It’s difficult for vendors to compare apples to apples with non-compatible files.

I wrote about imputation in the article about MyHeritage, here and also more generally, here. In a nutshell, imputation is a technique used to infer the DNA for locations a vendor doesn’t test (or doesn’t receive in a transfer file from another vendor) based on the location’s neighboring DNA and DNA that is “normally” passed together as a packet.

However, the imputed regions of DNA are not your DNA, and therefore don’t carry your mutations, if any.

I created the following diagram when writing the MyHeritage article to explain the concept of imputation when comparing multiple vendors’ files showing locations tested, overlap and imputed regions. You can click to enlarge the graphic.

Family Tree DNA has chosen not to utilize imputation for transfer files and only compares the actual DNA locations tested and uploaded in vendor files, while MyHeritage has chosen to impute locations for incompatible files. Family Tree DNA produces fewer, but accurate matches for incompatible transfer files.  MyHeritage continues to have matching issues.

MyHeritage may be using imputation for all transfer files to equalize the files to a maximum location count for all vendor files. This is speculation on my part, but is speculation based on the differences in matches from known compatible file versions to known matches at the original vendor and then at MyHeritage.

I compared matches to the same person at MyHeritage, GedMatch, Ancestry and Family Tree DNA. It appears that imputed matches do not consistently compare reliably. I’m not convinced imputation can ever work reliably for genetic genealogy, because we need our own DNA and mutations. Regardless, imputation is in its infancy today and due to the Illumina GSA chip replacing the OmniExpress chip, imputation will be widely used within the industry shortly for backwards compatibility.

To date, two vendors are utilizing imputation. LivingDNA is using imputation with the GSA chip for ethnicity, and MyHeritage for DNA matching.

Summary

Your best results are going to be to test on the platform that the vendor offers, because the vendor’s match and ethnicity algorithms are optimized for their own file formats and DNA locations tested.

That means that if you are transferring an Ancestry V1 file, a 23andMe V3 file or a MyHeritage file, for example, to Family Tree DNA, your matches at Family Tree DNA will be the same as if you tested on the FTDNA platform.  You do not need to retest at Family Tree DNA.

However, if you are transferring an Ancestry V2 file or 23andMe V4 file, you will receive some matches, someplace between one quarter and half as compared to a test run on the vendor’s own chip. For people who can’t be tested again, that’s certainly better than nothing, and cross-chip matching generally picks up the strongest matches because they tend to match in multiple locations. For people who can retest, testing at Family Tree DNA would garner more matches and better ethnicity results for those with 23andMe V2 and V4 tests as well as Ancestry V2 tests.

For absolutely best results, swim in all of the major DNA testing pools, test as many relatives as possible, and test on the vendor’s Native chip to obtain the most matches.  After all, without sharing and matching, there is no genetic genealogy!

______________________________________________________________________

Standard Disclosure

This standard disclosure appears at the bottom of every article in compliance with the FTC Guidelines.

Hot links are provided to Family Tree DNA, where appropriate. If you wish to purchase one of their products, and you click through one of the links in an article to Family Tree DNA, or on the sidebar of this blog, I receive a small contribution if you make a purchase. Clicking through the link does not affect the price you pay. This affiliate relationship helps to keep this publication, with more than 850 articles about all aspects of genetic genealogy, free for everyone.

I do not accept sponsorship for this blog, nor do I write paid articles, nor do I accept contributions of any type from any vendor in order to review any product, etc. In fact, I pay a premium price to prevent ads from appearing on this blog.

When reviewing products, in most cases, I pay the same price and order in the same way as any other consumer. If not, I state very clearly in the article any special consideration received. In other words, you are reading my opinions as a long-time consumer and consultant in the genetic genealogy field.

I will never link to a product about which I have reservations or qualms, either about the product or about the company offering the product. I only recommend products that I use myself and bring value to the genetic genealogy community. If you wonder why there aren’t more links, that’s why and that’s my commitment to you.

Thank you for your readership, your ongoing support and for purchasing through the affiliate link if you are interested in making a purchase at Family Tree DNA.

MyHeritage – Broken Promises and Matching Issues

For additional information and updates to parts of this article, written three months later, please see MyHeritage Ethnicity Results. My concerns about imputed matching, discussed in this original article, remain unchanged, but MyHeritage has honored their original ethnicity report promises for uploaders.

Original Article below:

My Heritage, now nine months into their DNA foray, so far has proven to be a disappointment. The problems are twofold.

  • MyHeritage has matching issues, combined with absolutely no tools to be able to work with results. Their product certainly doesn’t seem to be ready for prime time.
  • Worse yet, MyHeritage has reneged on a promise made to early uploaders that Ethnicity Reports would be free. MyHeritage used the DNA of the early uploaders to build their matching data base, then changed their mind about providing the promised free ethnicity reports.

In May 2016, MyHeritage began encouraging people to upload their DNA kits from other vendors, specifically those who tested at 23andMe, Ancestry and Family Tree DNA and announced that they would provide a free matching service.

Here is what MyHeritage said about ethnicity reports in that announcement:

myheritage-may-2016

Initially, I saw no matching benefit to uploading, since I’ve already tested at all 3 vendors and there were no additional possible matches, because everyone that uploaded to MyHeritage would also be in the vendor’s data bases where they had tested, not to mention avid genetic genealogists also upload to GedMatch.

Three months later, in September 2016, when MyHeritage actually began DNA matching, they said this about ethnicity testing:

myheritage-sept-2016

An “amazing ethnicity report” for free. Ok, I’m sold. I’ll upload so I’m in line for the “amazing ethnicity report.”

Matching Utilizing Imputation

MyHeritage started DNA matching in September, 2016 and frankly, they had a mess, some of which was sorted out by November when they started selling their own DNA tests, but much of which remains today.

MyHeritage facilitates matching between vendors who test on only a small number of overlapping autosomal locations by utilizing a process called imputation. In a nutshell, imputation is the process of an “educated guess” as to what your DNA would look like at locations where you haven’t tested. So, yes, MyHeritage fills in your blanks by estimating what your DNA would look like based on population models.

Here’s what MyHeritage says about imputation.

MyHeritage has created and refined the capability to read the DNA data files that you can export from all main vendors and bring them to the same common ground, a process that is called imputation. Thanks to this capability — which is accomplished with very high accuracy —MyHeritage can, for example, successfully match the DNA of an Ancestry customer (utilizing the recent version 2 chip) with the DNA of a 23andMe customer utilizing 23andMe’s current chip, which is their version 4. We can also match either one of them to any Family Tree DNA customer, or match any customers who have used earlier versions of those chips.

Needless to say, when you’re doing matching to other people – you’re looking for mutations that have occurred in the past few generations, which is after all, what defines genetic cousins. Adding in segments of generic DNA results found in populations is not only incorrect, because it’s not your DNA, it also produces erroneous matches, because it’s not your DNA. Additionally, it can’t report real genealogical mutations in those regions that do match, because it’s not your DNA.

Let’s look at a quick example. Let’s say you and another person are both from a common population, say, Caucasian European. Your values at locations 1-100 are imputed to be all As because you’re a member of the Caucasian European population. The next person, to whom you are NOT related, is also a Caucasian European. Because imputation is being used, their values in locations 1-100 are also imputed to be all As. Voila! A match. Except, it’s not real because it’s based on imputed data.

Selling Their Own DNA Tests

In November, MyHeritage announced that they are selling their own DNA tests and that they were “now out of beta” for DNA matching. The processing lab is Family Tree DNA, so they are testing the same markers, but MyHeritage is providing the analysis and matching. This means that the results you see, as a customer, have nothing in common with the results at Family Tree DNA. The only common factor is the processing lab for the raw DNA data.

Because MyHeritage is a subscription genealogy company that is not America-centric, they have the potential to appeal to testers in Europe that don’t subscribe to Ancestry and perhaps wouldn’t consider DNA testing at all if it wasn’t tied to the company they research through.

Clearly, without the autosomal DNA files of people who uploaded from May to November 2016, MyHeritage would have had no data base to compare their own tests to. Without a matching data base, DNA testing is pointless and useless.

In essence, those of us who uploaded our data files allowed MyHeritage to use our files to build their data base, so they could profitably sell kits with something to compare results to – in exchange for that promised “amazing ethnicity report.” At that time, there was no other draw for uploaders.

We didn’t know, before November, when MyHeritage began selling their own tests, that there would ever be any possibility of matching someone who had not tested at the Big 3. So for early uploaders, the draw wasn’t matching, because that could clearly be done elsewhere, without imputation. The draw was that “amazing ethnicity report” for free.

No Free Ethnicity Reports

In November, when MyHeritage announced that they were selling their own kits, they appeared to be backpedaling on the free ethnicity report for early uploaders and said the following:

myheritage-nov-2016

Sure enough, today, even for early uploaders who were promised the ethnicity report for free, in order to receive ethnicity estimates, you must purchase a new test. And by the way, I’m a MyHeritage subscriber to the tune of $99.94 in 2016 for a Premium Plus Membership, so it’s not like they aren’t getting anything from me. Irrespective of that, a promise is a promise.

Bait and Renege

When MyHeritage needed our kits to build their data base, they were very accommodating and promised an “amazing ethnicity report” for free. When they actually produced the ethnicity report as part of their product offering, they are requiring those same people whose kits they used to build their data base to purchase a brand new test, from them, for $79.

Frankly, this is unconscionable. It’s not only unethical, their change of direction takes advantage of the good will of the genetic genealogy community. Given that MyHeritage committed to ethnicity reports for transfers, they need to live up to that promise. I guarantee you, had I known the truth, I would never have uploaded my DNA results to allow them to build their data base only to have them rescind that promise after they built that data base. I feel like I’ve been fleeced.

As a basis of comparison, Family Tree DNA, who does NOT make anything off of subscriptions, only charges $19 to unlock ethnicity results for transfers, along with all of their other tools like a chromosome browser which MyHeritage also doesn’t currently have.

Ok, so let’s try to find the silk purse in this sows ear.

So, How’s the Imputed Matching?

I uploaded my Family Tree DNA autosomal file with about 700,000 SNP locations to MyHeritage.

Today, I have a total of 34 matches at MyHeritage, compared to around 2,200 at Family Tree DNA, 1,700 at 23andMe (not all of which share), and thousands at Ancestry. And no, 34 is not a typo. I had 28 matches in December, so matches are being gained at the rate of 3 per month. The MyHeritage data base size is still clearly very small.

MyHeritage has no tree matching and no tools like a chromosome browser today, so I can’t compare actual DNA segments at MyHeritage. There are promises that these types of tools are coming, but based on their track record of promises so far, I wouldn’t hold my breath.

However, I did recognize that my second closest match at MyHeritage is also a match at Ancestry.

My match tested at Ancestry, with about 382,000 common SNPs with a Family Tree DNA test, so MyHeritage would be imputing at least 300,000 SNPs for me – the SNPs that Ancestry tests and Family Tree DNA doesn’t, almost half of the SNPs needed to match to Ancestry files. MyHeritage has to be imputing about that many for my match’s file too, so that we have an equal number of SNPs for comparison. Combined, this would mean that my match and I are comparing 382,000 actual common SNPs that we both tested, and roughly 600,000 SNPs that we did not test and were imputed.

Here’s a rough diagram of how imputation between a Family Tree DNA file and an Ancestry V2 file would work to compare all of the locations in both files to each other.

myheritage-imputation

Please note that for purposes of concept illustration, I have shown all of the common locations, in blue, as contiguous. The common locations are not contiguous, but are scattered across the entire range that each vendor tests.

You can see that the number of imputed locations for matching between two people, shown in tan, is larger than the number of actual matching locations shown in blue. The amount of actual common data being compared is roughly 382,000 of 1,100,000 total locations, or 35%.

Let’s see how the actual matches compare.

2016-myheritage-second-match

Here’s the match at MyHeritage, above, and the same match at Ancestry, below.

2016-myheritage-at-ancestry

In the chart below, you can see the same information at both companies.

myheritage-ancestry

Clearly, there’s a significant difference in these results between the same two people at Ancestry and at MyHeritage. Ancestry shows only 13% of the total shared DNA that MyHeritage shows, and only 1 segment as compared to 7.

While I think Ancestry’s Timber strips out too much DNA, there is clearly a HUGE difference in the reported results. I suspect the majority of this issue likely lies with MyHeritage’s imputated DNA data and matching routines.

Regardless of why, and the “why” could be a combination of factors, the matching is not consistent and quite “off.”

Actual match names are used at MyHertiage (unless the user chooses a different display name), and with the exception of MyHeritage’s maddening usage of female married names, it’s easy to search at Family Tree DNA for the same person in your match list. I found three, who, as luck would have it, had also uploaded to GedMatch. Additionally, I also found two at Ancestry. Unfortunately, MyHeritage does not have any download capability, so this is an entirely manual process. Since I only have 34 matches, it’s not overwhelming today.

myheritage-multiple-vendors

*We don’t know the matching thresholds at MyHeritage. My smallest cM match at MyHeritage is 12.4 cM. At the other vendors, I have matches equivalent to the actual matching threshold, so I’m guessing that the MyHeritage threshold is someplace near that 12.4. Smaller matches are more plentiful, so I would not expect that it would be under 12cM. Unfortunately, MyHeritage has not provided us with this information.  Nor do we know how MyHeritage is counting their total cM, but I suspect it’s total cM over their matching threshold.

For comparison, at Family Tree DNA, I used the chromosome browser default of 5cM and 5cM at GedMatch. This means that if we could truly equalize the matching at 5cM, the MyHeritage totals and number of matching segments might well be higher. Using a 10cM threshold, Family Tree DNA loses Match 3 altogether and GedMatch loses one of the two Match 2 segments.

**I could not find a match for Match 1 at Ancestry, even though based on their kit type uploaded to GedMatch, it’s clear that they tested at Ancestry. Ancestry users often don’t use their name, just their user ID, which may not be readily discernable as their name. It’s also possible that Match 1 is not a match to me at Ancestry.

Summary

Any new vendor is going to have birthing pains. Genetic genealogists who have been around the block a couple of times will give the vendors a lot of space to self-correct, fix bugs, etc.

In the case of MyHeritage, I think their choice to use imputation is hindering accurate matching. Social media is reporting additional matching issues that I have not covered here.

I do understand why MyHeritage chose to utilize imputation as opposed to just matching the subset of common DNA for any two matches from disparate vendors. MyHeritage wanted to be able to provide more matches than just that overlapping subset of data would provide. When matching only half of the DNA, because the vendors don’t test the same locations, you’ll likely only have half the matches. Family Tree DNA now imports both the 23andMe V4 file and the Ancestry V2 file, who test just over half the same locations at Family Tree DNA, and Family Tree DNA provides transfer customers with their closest matches. For more distant or speculative matches, you need to test on the same platform.

However, if MyHeritage provides inaccurate matches due to imputation, that’s the worst possible scenario for everyone and could prove especially detrimental to the adoptee/parent search community.

Companies bear the responsibility to do beta testing in house before releasing a product. Once MyHeritage announced they were out of beta testing, the matching results should be reliable.  The genetic genealogy community should not be debugging MyHeritage matching on Facebook.  Minimally, testers should be informed that their results and matches should still be considered beta and they are part of an experiment. This isn’t a new feature to an existing product, it’s THE product.

I hope MyHeritage rethinks their approach. In the case of matching actual DNA to determine genealogical genetic relationships, quality is far, far more important than quantity. We absolutely must have accuracy. Triangulation and identifying common ancestors based on common matching segments requires that those matching segments be OUR OWN DNA, and the matches be accurate.

I view the matching issues as technical issues that (still) need to be resolved and have been complicated by the introduction of imputation.  However, the broken promise relative to ethnicity reports falls into another category entirely – that of willful deception – a choice, not a mistake or birthing pains. While I’m relatively tolerant of what I perceive to be (hopefully) transient matching issues, I’m not at all tolerant of being lied to, especially not with the intention of exploiting my DNA.

Relative to the “amazing ethnicity reports”, breaking promises, meaning bait and switch or simply bait and renege in this case, is completely unacceptable. This lapse of moral judgement will color the community’s perception of MyHeritage. Taking unfair advantage of people is never a good idea. Under these circumstances, I would never recommend MyHeritage.

I would hope that this is not the way MyHeritage plans to do business in the genetic genealogy arena and that they will see fit to reconsider and do right by the people whose uploaded tests they used as a foundation for their DNA business with a promise of a future “amazing ethnicity report.”

I don’t know if the ethnicity report is actually amazing, because I guarantee you, I won’t be paying $79, or any price, for something that was promised for free. It’s a matter of principle.

If MyHeritage does decide to reconsider, honor their promise and provide ethnicity reports to uploaders, I’ll be glad to share its relative amazingness with you.