In response to my article about haplogroup C3*, a regular contributor, Armando, left the following comment:
“Roberta, there was a problem with the way Felix was processing files and he had to change the Clovis Anzick file three times at Gedmatch. The last one is kit F999919 uploaded October 8, 2014. You can see his post on that at http://www.fc.id.au/2014/10/new-clovis-anzick-1-kit-in-gedmatch.html
If you do one-to-many matching on Clovis Anzick F999919 at Gedmatch there is not a single person that reports to have mtDNA M. Your extracts for Clovis Anzick are from September 24, 2014 and therefore are based on a bad file which was kit F999912. The older bad kits F999912 and F999913 have been deleted from Gedmatch. Felix mentions the updates at http://www.fc.id.au/2014/09/clovis-anzick-1-dna-match-living-people.html“
This comment came in on Christmas Eve, and I replied that I would look into this after the holidays.
Given that it was Christmas Eve, I certainly wasn’t going to bother anyone over the holidays with questions, so I quickly ran a one to many compare for the current Anzick kit, F999919, and found at 5cM and below that there were 4 haplogroup M matches.
As I did before, I sent emails to those who provided e-mail addresses asking about their matrilineal heritage.
The first thing I wanted to do, of course, was to check with Felix. I knew that Felix had updated the kits, but my understanding was that he added SNPs from the various companies to create a single file with all the SNPs from all three testing companies, not that any file was bad, so to speak.
I asked Felix if the original files had problems or were bad, and here is his response.
“I can assure you none of the earlier/older versions uploaded to GEDmatch (kit# F999912 and F999913) of Clovis Anzick was bad.
- F999912 – Contains only FTDNA SNPs extracted from VCF file provided by authors.
- F999913 – Contains all SNPs used by DNA testing companies extracted from VCF file provided by authors.
- F999919 – Contains all SNPs used by DNA testing companies processed from BAM file provided by authors.
- VCF Source: http://www.cbs.dtu.dk/suppl/clovis/data/Anzick-1/genotypes/
- BAM Source: http://www.cbs.dtu.dk/suppl/clovis/data/Anzick-1/bams/
I removed the earlier versions not because they are bad but only to avoid redundancy for the same sample kit, and processed BAM file (which is a 41 GB file) contains significantly more SNPs compared to VCF source. Because the latest file has more SNPs, it is possible that some missing SNPs in earlier uploads (which was assumed as matching in GEDmatch) may actually have mismatches in new file and thus, could fall below the thresholds or could break the previously matching segment.
The difference in matches between F999912 and F999919 kit for Clovis Anzick is similar to difference in matches between a 23andMe V4 kit and V3 kit for the same person.”
After thinking about this some, it occurred to me that perhaps GedMatch was treating different files from different vendors differently in their matching and sorting routines. That might account for a difference in matching. So, I asked John Olson at GedMatch.
John’s reply is as follows:
“At one time, I did use different thresholds depending on which vendor was being compared to which other vendor. That was a holdover from when FTDNA had Affymetrix kits that were producing somewhat different results than Illumina kits. I have since changed the one-to-many thresholds to 5cM/500 SNPs for all comparisons. The one-to-one thresholds default to 7cm/700 SNPs. I believe I made that change about a year ago, but it may have been longer. At any rate, they are all the same now, and I’m pretty sure they are all the same since Felix has introduced the F9999xx kits. Another change made within the past year is to treat A=T and C=G for all comparisons. This was done to get rid of single SNP errors in the few cases where one vendor was reporting a different strand than another vendor. In a few cases, I have observed that this “heals” some single-SNP breaks in otherwise continuous matching segments.
It is possible that older one-to-many comparisons may have been made under slightly different conditions than newer ones. Older comparisons made with a 3cm/300 SNP threshold may show larger total segment match if they contained many very small matching segments. This usually happens with endogamous populations. Comparisons affected by the change to A=T, C=G may show a larger matching segment where 2 smaller matching segments existed previously.
Another issue to be aware of when comparing artificial kits is that there may be large gaps between the defined SNPs. So, even if there is a gap of a million SNPs, the GEDmatch comparison algorithm will treat them as contiguous. This works OK when everybody is using the same SNPs, but when the list of SNPs is significantly different, it may produce matches that are bogus. This is particularly obvious when generating artificial kits that are missing large segments of data. I have had to deal with this issue with phased kits and Lazarus kits by introducing the concept of a “hard break” that forces a break between smaller matching segments.”
I wanted to know how the three files that Felix prepared compared relative to the matches they produced. I originally ran several comparisons with each of the first two versions, kits F999912 and F999913, and I didn’t save all of the original files, but I do have at least one file saved from each version. Therefore, I dropped all three sets of results (F999912, F999913 and F999919) into a spreadsheet to see how matching compared between the three Anzick file versions.
Keep in mind that the first file (F999912) contained just the FTDNA SNPs, while the second (F999913) and third (F999919) files contain the SNPs from all of the testing companies. This could potentially make the participant files appear to have missing segments when the matching routine at GedMatch sees SNPs in the Anzick file not in the participant files. However, this shouldn’t be much different than comparing a file from two different vendors except that the Anzick file has the SNPs from all three vendors combined.
The first file from 9-23 at the default threshold had 491 matches, but I subsequently lowered the threshold so I could see as many matches as possible.
GedMatch only shows you your closest 1500 matches, although I now know that as of 12-31-2014, there were a total of 3442 Anzick matches at the 5cM threshold.
The second file from 9-29, run at 6cM had more than 1500 matches. I ran the third kit at default settings on December 27th and it has 720 matches.
One would expect that the second and third files would have the effect of including more matches from both 23andMe and Ancestry since all of the SNPs utilized by those companies are included (if they are available in the Anzick sample.) We also have to remember that there are new files being uploaded from all three vendor sites on a daily basis, so the total available to match is also increasing. Of the 721 kit matches to F999919, 31 were shades of green which indicate that they have been uploaded during the last 30 days, so we could probably presume that about double that number were uploaded (and match) in two months or triple in three months, so probably about 100 new kits. Those kits would show in the match extraction for this month but not for the first month and possibly not for the second. However, all the kits that matched the first month at the highest threshold should still be showing in the second and third month. Let’s see if that holds true.
I dropped all three sets of data into a spreadsheet and colorized the rows.
- Blue = F999912, first extraction, 9-23-2014
- Yellow = F999913, second extraction, 9-29-2014
- Pink = F999919, third extraction, 12-27-2014
Then I counted the number of blue rows, which are the first extraction, that had matches to both yellow and pink, or only yellow, the second extraction, or only pink, the third (current) extraction, or no matches at all.
You can see that the green grouping shows that all three match each other. The match between A003479 in both the second and third extraction could be because the kit was not present when the first extraction was done.
|All 3 match||1st to 2nd Only||1st to 3rd Only||No Match|
|Percent First Extraction Matches to Other Extractions||54%||36%||5%||5%|
By percent, this is how the matching between kits worked. About half of the kits in the first extraction continued to match kits in both subsequent extractions. Of the remaining half, three quarters of the balance matches the second extraction only and a few match just the third extraction or no extraction at all. For the most part, there is no evident reason upon inspection why the kits would not match the second or third extraction, so the cause has to be a result of the additional SNPs or the matching routine or both. This is not to imply that the results are problematic, just that they are different than I would have expected.
A very low percentage of kits matched only between the first and third extracts and the same percentage had no matches in either the second or third extraction.
I took a closer look at the kits with no matches at all. All of them had relatively low threshold total cM and largest segment size. The smallest total cM was 7.1 and the largest was 8.2. The smallest segment was 7.1 and the largest segment was also 8.2. All of these entries had the total cM equal to the largest cM. It appears that these simply slipped below the match threshold, but that doesn’t appear to be the case because in the current (pink) extract, a total of 171 entries were at or below 8.2 total cM and 8.2 largest cM and several kits had the exact same cM as the kits that didn’t show up from the first (blue) extract as a match – so obviously something truly was different in the SNPs or how the matching was done.
Is there any correlation to the kits in the original extract that didn’t match any other extract in terms of which testing company the participants utilized?
One Ancestry kit (4%), 18 23andMe kits (64%), 7 Family Tree DNA kits (25%) and 2 FN kits (7%) didn’t match anyone. But how many kits were in the original extract from the various companies?
|Original Kit Matches||Second KitMatches||Current Kit Matches|
|Ancestry Kits (A)||26 (5%)||438 (29%)||199 (28%)|
|FTDNA Kits (F)||94 (19%)||295 (20%)||121 (17%)|
|Other F+ Kits*||15 (3%)||35 (2%)||15 (2%)|
|23andMe Kits (M)||354 (72%)||732 (49%)||382 (53%)|
*FB, FN, FE, FV
The effect of the additional SNPs in the kits seems to have been to increase the Ancestry kit matches significantly.
It was interesting to see how the same person’s kit from different vendors compared as well. In this random example, the Family Finder kit has a higher total cM and largest segment than the 23andMe v3 kit.
Here’s a kit from one person at all three vendors, but the 23andMe kit is version 4, in which 23andMe significantly reduced the number of SNPs tested by about one third, from about 900,000 to about 600,000.
I wondered if there is a difference in what is reported based on the threshold selected. Now at first glance, one would think, “well of course there is a difference,” but the difference should be on the bottom end of the list. In other words, the top matches should be the top matches at 7cM, 6cM, 5cM, etc. The top matches at 7cM would still be the top at 6cM, just more smaller matches appended to the end of the match list – or that is what I would expect.
Let’s see if this holds true with the current file.
I ran the “one to many” option for the current Anzick kit, F999919, at seven different levels, on the same day, one right after the other, as follows:
- 7cM, 700 SNPs
- 6cM, 600 SNPs
- 5cM, 500 SNPs
- 4cM, 400 SNPs
- 3cM, 300 SNPs
- 2cM, 200 SNPs
- 1cM, 100 SNPs
The first extract produced 719 records. The rest were all over the 1500 threshold, so we only see the first 1500. Normally, for genealogy the 1500 threshold would certainly be adequate, but for research, the threshold is frustrating.
To make this easier let me say that the extracts from 5cM down through 1cM were exactly the same, but the extracts at 7, 6 and 5cM, respectively, were not.
Discussions with John Olson at GedMatch shed some light on why the 5cM through 1cM extracts were exactly the same.
“For the past year or so, the database has only stored matches down to 5 cM.”
I sure wish I had known that BEFORE I did all of those extracts.
I combined and color coded all 7 extractions into a spreadsheet.
Most of the grouping look like this where blue=7cm, pink=6cM, grn=5cM, purple=4cm, teal=3cm,apricot=2cm, yellow=1cm. Nice rainbows.
All of the matches from the 7cM extraction, with the exception of a few X matches at the end, some of which have no matches on chromosomes 1-22, are included in the 6cM and 5cM extractions, but after the first several records, they are not in the same position. In other words, they are not the top 719, in the same order, in either the 5 or 6cM extraction, but the 5cM through 1cM extractions are identical. Of course, now we know why the 5cM through 1cM matches are exact. From here forth in the article, I won’t mention the 4cM-1cM extracts because they are the same as the 5cM extract.
For example, looking at the kit in position 712, the last non-X match in the 7cM extract – you find this same kit at row 1140 in the 6cM extract and row 1489 in the 5cM extract.
The 6cM extract appears to have some issues. I ran this twice with the same parameters to be sure there wasn’t an error in how it was set up, and the two runs were identical.
There are about 350 individuals who show up in the 6cM extract who should show up in the 5cM extract as well, but who don’t show in the 5cM extract. They are under the threshold for the 7cM extract, so that is correct, but why are these 350 individuals not appearing as matches at the 5cM threshold?
The kits noted above are the largest non-matching total cM and largest cM that don’t show up in the 5cM extract. The smallest matches are 6.1 and 6.1, respectively.
Checking the 5cM extract, below, there are files with smaller total cMs and a smaller largest segment that are showing as matches.
However, looking at the kits with the smallest cMs at the 5cM level, the smallest total cMs is 6.9 and it is combined with the largest segment of 6.9 as well, so that is above the 6.8 and 6.8 shown above. The smallest individual segment is 5.1 but the total cM for that individual is 10.1. So obviously the matching threshold at GedMatch is some combination of both the total cM and the largest segment. This is somewhat unexpected, but doesn’t seem to be a red flag, just how this system works.
So, where are we?
I am glad to have Felix confirm that the files weren’t “bad,” only truly “new and improved,” and that the matching between the various files is pretty much as expected – and from various tests run, everything pretty much looks kosher. The newer files with all of the SNPs utilized by the companies seem to level the playing field, allowing Ancestry kits a better chance of matching.
Aside from my intense interest due to the Native American connection, this is also how I’ve been extracting potential Native American mitochondrial haplogroups from the Anzick matches, including haplogroup M, for my research notes. M is potentially a Native American haplogroup, but is as yet unproven. With haplogroup M showing up in these people who are often heavily Native, and often from Mexico, Central and South America where 80% of the mitochondrial population is believed to be of Native American heritage, it seems prudent to add them to my research notes for further research and possible proof in the future. I contact individuals and ask about their matrilineal heritage. If they don’t have Asian or genealogically proven heritage elsewhere, and their families emerge from the areas with high Native frequencies, I include them on the research list.
In the three days between the two extracts this past week, three of the four haplogroup M individuals were pushed below the match threshold and are no longer visible at the default level. Yes, I have confirmed hat they are still there just not visible at the 1500 match threshold.
I have contacted the individuals with e-mail addresses, asking about their matrilineal heritage. One person said the tester’s mother’s heritage was from India, so that haplogroup M is not on the research list, of course, because it is proven to be from elsewhere – a place where haplogroup M and subgroups are quite common.
In total, there were 15 new potentially Native DNA mitochondrial DNA haplogroups listed in the 12-27 extract. I’ll be adding those to my research notes as soon as I have the opportunity to contact these folks and ask about their known matrilineal genealogy.
I didn’t really anticipate that there would be so much change, nor so quickly, so it looks like I’m going to have to check the Anzick matches for potential Native mitochondrial haplogroups much more often.
Since it looks like there may be lots of additions over time, far more than I expected, I’ll also be going back and making better notes in my research file. I will, for example, note the kit number and date for all of the extractions. For this and future extractions, I’ll also be listing the number of results per haplogroup. I think that would be valuable information as well.
I’d like to thank Armando for raising this topic. The research into matching with a kit that has the entire spectrum of SNPs from all three of the companies has been quite interesting. In fact, unless Felix has added all of the SNPs to the other ancient kits, this is the only kit in existence that has all of the SNPs from all of the companies included.
My thanks to Felix Immanuel (formerly Felix Chandrakumar) and John Olson for assistance with research for this article.