Big Y Matching

A few days ago, Family Tree DNA announced and implemented Big Y Matching between participants who have taken the Big Y test.

This is certainly welcome news.  Let’s take a look at Big Y matching, what it means and how to utilize the features.

First, there are really two different groups of people who will benefit from the Big Y tests.

People trying to sort through lines of a common and related surname – like the McDonald or Campbell families, for example – and haplogroup researchers and project administrators.

My own family, for example, is badly brick walled with Charles Campbell first found in Hawkins County, TN in the 1780s.  We know, via STR testing that indeed, he matches the Campbell Clan from Scotland, but we have no idea who is father might have been.  STR testing hasn’t been definitive enough on Charles’ two known sons’ descendants, so I’m very hopeful that someday enough Campbell men will test that we’ll be able between STR and SNP mutations to at least narrow the possible family lines.  If I’m incredibly lucky, maybe there will be a family line SNP (Novel Variant) and it won’t just narrow the line, it will give me a long-awaited answer by genetically announcing which line was his.  Could I be that lucky???  That’s like winning the genetic genealogy lottery!

For today, the Big Y test at $695 is expensive to run on an entire project of people, not to mention that many of the original participants in projects, the long-time hard-core genealogists, have since passed away.  We are now into our 15th years of genetic genealogy.

For those studying haplogroups, the Big Y is a huge sandbox and those researchers have lost no time whatsoever comparing various individuals’ SNPS, both known and novel, and creating haplogroup trees of those SNPs.  This is done by hand today, or maybe more accurately stated, by Excel.  This is “not fun” to put it mildly.  We owe these folks a huge debt of gratitude.  Their results are curated and posted, provisionally, on the ISOGG Tree.

There is an in-between group as well, and those are people who are working to establish relationships between people of different surnames.  In my case, Native American ancestors whose descendants have different surnames today, but who do share a common ancestor in some timeframe.  That timeframe of course could be anyplace from a couple hundred to several thousand years, since their entry into the Americas across Beringia someplace in the neighborhood of 12-15 thousand years ago.

The Big Y matching is extremely helpful to projects.

Let’s take a look.

Big Y Matches

Big Y landing

On your personal page, under “Other Results,” you’ll see the Big Y results.  Click on Results” and you’ll see the following page.

big y results

The Known SNPs and Novel Variants tabs have been there since release, but the Matching tab, top left, is new.

By clicking on the Matching tab, you will then see the men you match based on your terminal SNP as determined in the Big Y Known SNPs data base.  You will be matched to men who carry up to and including 4 mutations difference in known SNPs, and unlimited novel variant differences.  If you have a zero in the “Known SNP Difference” column, that means you have no differences at all in known SNPs.

big y matches cropped2

The individual being used for an example here has paternal ancestry from Hungary.  His terminal SNP is reported as R-CTS11962.  Therefore, all of the people he matches should also carry this same SNP as their terminal SNP.

This is actually quite interesting, because of his 10 exact matches, 9 of them have surnames or genealogy that suggests eastern European/Slavic ancestry.  The 10th, however, which happens to be his closest match, carries an English surname and reports their ancestor to be from Yorkshire, England.  His one mutation differences carry the same pattern, with one being from England and two of the other three from eastern Europe.

Our participant has 155 total Novel Variants, 135 high quality and 20 medium quality.  Only high quality are listed in the comparison.  Medium quality are not.

Ancestral Location Known SNP Difference Shared Novel Variants Non Matching Known SNPs
Yorkshire, England 0 134 None
Prussia 0 127 None
Ukraine 0 121 None
Poland 0 121 None
Belarus 0 119 None
Poland 0 116 None
Poland 0 116 None
Russian e-mail 0 113 None
Bulgaria 0 113 None
Slovakia 0 111 None
English surname 1 126 PF6085
Undetermined, poss German 1 121 F1816
Poland 1 118 F552
Poland 1 116 CTS10137
Prussia 2 122 CTS11840 PF4522
Poland 2 112 L1029 PR6932
Russia 3 116 CTS3184 L1029 PF3643
Poland 3 106 CTS11962 L1029 L260
Ukraine 3 105 CTS11962 L1029 L260
Poland 3 104 CTS11962 L1029 L260
Poland 3 100 CTS11962 L1029 L260
Poland 3 99 CTS11962 L1029 L260
Eastern European surname 3 98 CTS11962 L1029 L260
Poland/Germany 3 97 CTS11962 L1029 L260
Austria/Galacia 3 93 CTS11962 L1029 L260
Poland 4 97 CTS11562 CTS11962 L1029 L260

It’s also very interesting to note that his non-matching known SNPs tend to cluster.  Non-matching known SNPs can go in either direction – meaning that they could be absent in our participant and present in the rest, or vice versa.

l1029 search

It’s easy to tell.  In the Big Y Results, under Known SNPs, there is a search feature.  This means that it’s easy to search for SNPs and to determine their status.  For example, above, our participant does carry SNP L1029 (he’s derived or positive (+) for the mutation in question).  This means that our participant has developed L1029, and, it just so happens, also CTS11962 and L260, the three clustered SNPs, since these men shared a common ancestor.

It’s difficult not to speculate a little.  If the TMCRA Big Y SNP estimates are correct, this suggests that these 3 clustered SNPS occurred someplace between 4350 and about 5000 years ago, based on the range (93-106) of the number of high quality novel variant differences.  We’ll talk more about this in a minute.

f552 search

For SNP F552, our participant is negative, meaning that that other person has developed this SNP since their shared ancestor.  In fact, he’s negative for all of the other Known SNP differences.

Novel Variants

The Novel Variants are quite interesting.  Novel Variants are mutations that if found in enough people who are not related within a family group will someday become SNPs on the tree.  Think of them as ripening SNPs.

By clicking on the “Show All” dropdown box you can see the list of the participants novel variants and how many of his matches share that Novel Variant.

novel variant list

In this example, all 26 of our participant’s novel variants share 13142597.  I’m thinking that this Novel Variant will someday become classified as a SNP and not as a Novel Variant anymore.  When that happens, and no, we don’t know how often Family Tree DNA will be reviewing the Novel Variants for SNP candidates, it will no longer be in the Novel Variant list.  The Novel Variants are meant to be family, novel or lineage SNPs, not population based SNPS that apply to a wide variety of people.  Finding these, of course, and adding them to the human haplotree is the entire purpose of full sequence Y chromosomal testing.  Just look at tall of this new information about this man’s ancestors and the DNA that they passed on to this gentleman.

By scrolling down to the bottom of that list, we find that our participant has 8 different Novel Variants where he matches only one individual.  By clicking on the Novel Variant number, you can see who he matches.  Of those 8, 7 of them match to the man who carries the English surname and one matches to a gentleman from Prussia.

This information is extremely interesting, but it gets even more interesting when compared against STR matches.  Our participant has a fairly unusual haplotype above 12 markers.  He has three 67 marker matches, two 37 marker matches and thirty-three 25 marker matches.  None of the men he matches on the SNP test match him on any of those tests.  I did not check his 12 marker matches, because I felt that anyone who would invest the money in the Big Y would certainly have tested above 12 markers plus our participants has several hundred 12 marker matches.

The numbers being bantered around by people working with SNP information suggest that one Big Y mutation equals about 150 years.  If this is true, then his closest match, the English gentleman from Yorkshire, England would share an ancestor about 2850 years ago.  That is clearly beyond the reach of STR markers in terms of generational predictions, so maybe STR matches are not expected in this situation, IF, the 150 year per novel variant estimate is close to accurate.

Another interesting piece of information that can be deduced from this information is how many SNPs were actually found.

At the bottom of our participants page, under Known SNPs, it says “Showing 24 of…571 entries (filtered from 36,274 total entries.)”  We know that the entire data base of SNPs that Family Tree is utilizing, which includes but is not limited to the 12,000+ Geno 2.0 SNPs, is 36,274.  In other words, 36,274 are the number of SNPs available to be found and counted as a SNP because they have already been defined as such.  Any other SNPs discovered are counted as Novel Variants.

Not all available SNPs are found and read in this type of next generation test.  The number of “Matching SNPs” with each individual gives us an idea of how many SNPs actually were found and read at either a medium and high confidence level.  Low confidence SNPs and no-calls are eliminated from reporting.

Our participants best match matches him on 25,397 SNPs.  This leaves a total of 10,877 SNPs that were not called.

The Future

SNP Matching is a wonderful feature and a first in this industry.  A hearty thank you to Family Tree DNA!

However, like all passionate people, we are already looking ahead to see what can be and should be done.

Here are some suggestions and questions I have about how the future will unwrap relative to Big Y SNP testing and matching.

  1. Within surname projects, matching should be relatively easy, unless hundreds of people test. I would be happy to have that problem. Today, administrators are creating spreadsheets of matches and novel SNPs and attempting to “reverse engineer” trees. In family groups, those trees would be of Novel SNPs, and in haplogroup projects, those trees would be of both Known SNPs and Novel Variants and where the Novel SNPS slip in-between the known SNPs to create new branches and sub-branches of the haplotree. We, as a community, need some tools to assist in this endeavor, for both the surname project admin and the haplogroup project admin as well.
  2. As new SNPs are discovered in the future, one will not be retested on this platform. As new SNPs are added to the tree, this could affect the matching by terminal SNP. Family Tree DNA needs to be prepared to deal with this eventuality.
  3. As a community, we desperately need a better tool to determine our actual “terminal SNP” as opposed to the Geno 2.0 terminal SNP. Yes, I know the ISOGG tree is provisional, but the contributed tools initially provided by volunteers to search the ISOGG tree utilizing the known SNPs reported in Big Y no longer work. We desperately need something similar while Family Tree DNA is revamping its own tree. I would hope that Family Tree DNA could add something like a secondary “search ISOGG tree” function as a customer courtesy, even if it needs some disclaimer verbiage as to the provisional nature of the tree.
  4. With the number of SNPs being searched for and reported, no calls begin to become an issue, especially if the no-call happens to be on the terminal SNP. We need to be able to determine whether a non-match with someone is actually a non-match or could be as a result of a no-call, and without resorting to searching raw data files. Today, participants can order a SNP test of a SNP position that has been reported as a no-call, but one needs to first figure that out that it is a no-call by looking at the BAM and BED files, something that is beyond the capability of most genetic genealogists. Furthermore, in the case of a “suspicious” no-call, where, for example, individuals in the same surname project with the same surname and other matching SNPS and STRs, some type of “smart-matching” needs to be put into place to alert the participant and project admin of this situation so that they can decide up on a proper course of action. In other words, no-calls need to be reported and accounted for in some fashion, as they are important data points for the genetic genealogist.

I am extremely grateful to Family Tree DNA for their efforts and for Big Y matching.  After all, matching is the backbone of genetic genealogy.  This list is not a complaint list, in any sense.  Family Tree DNA has a very long history of being responsive to their client base and I fully expect they will do the same with the next step in the Big Y journey.

The story of our DNA is not yet told.  Where our STR matches are found and where our SNP matches are found tells the story of the migration of our ancestors.  Today, SNPs and STRs promise to overlap, and already have in some cases.  If I could, I would order a Big Y test for every individual that I sponsor and for every person in each of my projects. I feel that these tests, combined, will help immensely to complete the puzzle to which we have disparate pieces today.  I look forward to the day when the time to the most recent common ancestor can be calculated by utilizing the Y STR markers, the known SNPs and the Novel Variants.  In a very large sense, the future has arrived today.  Now, we just have to test and figure out how all of the puzzle pieces fit together.

If you haven’t yet ordered a Big Y, you can order here.  The more people who test, the larger the comparison data base, and the sooner we will all have the answers we seek.