DNAGedcom Client

DNAGedcom provides an incredibly cool tool that has helped me immensely with my genealogy research, particularly at Ancestry and Family Tree DNA. This tool doesn’t replace what Ancestry and Family Tree DNA provide, but augments the functionality significantly.

I’ve been frustrated for months by the broken search function at Ancestry, and the DNAGedcom tool allows you to bypass the search function entirely by downloading the direct line ancestral information for all of your matches. So let’s use my Ancestry account as an example.

Utilizing DNAGedcom

After installing the DNAGedcom tool on your system, sign on to your Ancestry account through the tool. The tool downloads all of your matches, the people you match in common with them, and the ancestors in your matches’ trees.

The best part about this is that the results are then in a spreadsheet file that you can simply sort utilizing normal spreadsheet functions. I wrote about using spreadsheets for genetic genealogy in the article, Concepts – Sorting Spreadsheets for Autosomal DNA.

In my case, this means I can see everyone who I match that has an Estes, or any other surname, in their tree. I don’t have to look at my matches’ trees one at a time.

You can read about this very cool tool at this link, including how to subscribe for either $5 per month or $50 per year. Many functions at DNAGedcom are free, but the Ancestry tool is available through a minimal subscription which helps to support the rest of the site.

After subscribing, the DNAGedcom client will become available to you on your subscriber page at DNAGedcom.

Please note that you can click to enlarge any image.

After you subscribe, you’ll see the link for the Ancestry download tool, along with other resources.

You will want to follow the installation directions, exactly, to download the DNAGedcom client onto your PC or Mac in preparation for downloading your Ancestry match information onto your system. This is painless and goes quickly.

Next, you will be prompted to sign in to both DNAGedcom and Ancestry, through the tool, and then you will be prompted for three separate steps at Ancestry:

  • Gather Matches – took about 10 minutes
  • Gather Trees – let’s just say you might want to run this one overnight, and on a directly connected system, not wifi. Mine was about 25% complete at the 2 hour mark
  • Gather ICW – another several hours, but you can do other things on your system at the same time

The downloaded files will be stored on your computer as .csv files. On my PC, the default location was in the Documents directory and the files are named as follows:

  • a_Roberta_Estes (the ancestors of my matches)
  • icw_Roberta_Estes (the people I match and who I match in common with them)
  • m_Roberta_Estes (information about the match, such as cMs, etc.)

It’s important to make a note of this, as I didn’t find the file names documented elsewhere.

The good news is that even though these steps take a long time, having all of this information in a place where you can sort it and use it effectively is extremely useful. You can run the various steps at night or when you aren’t otherwise using your system.

In addition, if someone is sharing their DNA results with you on Ancestry (which they can under the settings gear), you can download the same data for their account – and then you can look for commonalities between groups of results using the DNAGedcom Match-O-Matic tool, also described in the introductory document.

Using the Downloaded Files

Personally, what I wanted to do was to search for all occurrences of a particular surname. Fortunately, it was Claxton or Clarkson, not Smith.

Simply using Excel (after saving the results file in Excel format), I was able to quickly sort for these surnames, an example shown below. Hmmm, I wonder if Claxon is relevant too. I never considered that possibility – nor would I have ever seen Claxon in a surname search, because I wouldn’t have searched for Claxon..

I’m brick walled on the Claxton line in Russell County, Virginia in about 1799. My ancestor, James Lee Claxton, was born someplace in Virginia about 1775. Utilizing Y DNA, we know of another man, also named James Claxton, born about 1750 first found in Granville and Bertie County, NC, who sired an entire lineage of Claxtons who migrated to Bedford County, TN.  However, that James is not the father of my ancestor, because that James had a different son named James. Other than these two distinct groups, we can’t seem to match with anyone else who has tested their Y DNA at Family Tree DNA, so my hope, for now, is an autosomal match with a known Claxton line out of Virginia.

(Shameless plug – if you are a Claxton or Clarkson male, please test your Y DNA at Family Tree DNA and join the Claxton DNA project. If you have Claxton or Clarkson ancestry from any line, and have taken the Family Finder test or transferred autosomal results from another vendor, please join the Claxton/Clarkson DNA project at Family Tree DNA. If you have Claxton or Clarkson ancestry and haven’t yet DNA tested, please do.)

Therefore, my goal is to find matches to other Claxton or Clarkson individuals who don’t share a known common known ancestor with me. Because we don’t share a known common ancestor, of course, these people would never be shown as an Ancestry green leaf “DNA+tree match,” nor is there another way for me to obtain a surname list like this at Ancestry.

After finding Claxton candidates, then I can refer to the other downloaded files or sign on to my account at Ancestry to look at the match itself and other ICW matches. Hopefully, some of my matches will also match some of my Claxton cousins as well, which would suggest that the match might actually be through the Claxton line.

The DNAGedcom client also downloads the same type of information from 23andMe, which isn’t nearly as useful without trees, as well as from Family Tree DNA.

Thanks so much to www.dnagedcom.com.

_____________________________________________________________________

Standard Disclosure

This standard disclosure appears at the bottom of every article in compliance with the FTC Guidelines.

Hot links are provided to Family Tree DNA, where appropriate. If you wish to purchase one of their products, and you click through one of the links in an article to Family Tree DNA, or on the sidebar of this blog, I receive a small contribution if you make a purchase. Clicking through the link does not affect the price you pay. This affiliate relationship helps to keep this publication, with more than 850 articles about all aspects of genetic genealogy, free for everyone.

I do not accept sponsorship for this blog, nor do I write paid articles, nor do I accept contributions of any type from any vendor in order to review any product, etc. In fact, I pay a premium price to prevent ads from appearing on this blog.

When reviewing products, in most cases, I pay the same price and order in the same way as any other consumer. If not, I state very clearly in the article any special consideration received. In other words, you are reading my opinions as a long-time consumer and consultant in the genetic genealogy field.

I will never link to a product about which I have reservations or qualms, either about the product or about the company offering the product. I only recommend products that I use myself and bring value to the genetic genealogy community. If you wonder why there aren’t more links, that’s why and that’s my commitment to you.

Thank you for your readership, your ongoing support and for purchasing through the affiliate link if you are interested in making a purchase at Family Tree DNA.

Working with the New Big Y Results (hg38)

If you are a Family Tree DNA customer, and in particular, a male or manage male kits, you’re familiar with the Big Y test.

The Big Y test scans the entire gold standard region of the Y chromosome, hunting for mutations, called SNPs, that define your haplogroup with great precision. This test also discovers SNPs never before found.  Those newly discovered SNPs may someday become new haplogroup branches as well. The Big Y test is how the Y DNA phylotree has been expanded from a few hundred locations a few years ago to more than 78,000, and along with that comes our understanding of the migration patterns of our ancestors.

We’re still learning, every single day, so testing new people continues to be important.

The Big Y is the logical extension of STR testing (panels 37, 67 and 111), which focus on genealogical matches, closer in time, instead of haplogroup era matches. STR locations mutate more rapidly than SNPs, so the STR test is more useful for genealogists, or at least represent an entry point into Y DNA testing. SNPs generally reach further back in time, showing us where are ancestors were before STR test results kick in.  More and more, those two tests have some time overlap as more SNPs are discovered.

If you want to read more, I wrote about this topic in the article, “Why the Big Y Test?”.  Ignore the pricing information at the end of that article, as it’s out of date today.

Before we talk about the new format of the Big Y results, let’s take a step back and look at the multiple reasons why Family Tree DNA created a new Big Y experience.

The first reason is that the human reference genome changed.

What is the Human Reference Genome?

The Human Reference Genome is a genetic map against which everyone else is compared.  In essence, it’s an attempt to give every location in our genome an address, and to have them all line up on streets where they belong on a nice big chromosome by chromosome grid.

That’s easier said than done.  Let’s look at why and begin with a little history.

Hg refers to the human reference genome and 38 is the current version number, released in December of 2013.

The previous version was hg19, released in February of 2009.

This seems like a long time ago, but each version requires extensive resources to convert data from previous versions to the newer version.  Different versions are not compatible with each other.

You can read more about this here, here, here and here, if you really want to dig in.

Hg19, the version that we’ve been using until now, was based only on 13 anonymous volunteers from Buffalo, New York. Hg38 uses far more samples and resequences previously sequenced results as well. We learned a lot between 2009 when the previous version, hg19, was released and 2013 when hg38 was released.

Keeping in mind that people are genetically far more alike than different, sequencing allows most of the human genome to be mapped when the genomes of those reference individuals are compared in layers, stacked on top of each other.

The resulting composite reference map, regardless of the version, isn’t a reflection of any one person, but a combination of all of those people against which the rest of us are compared.

Areas of high diversity, in this case, Y SNPs, may differ from each other. It’s those differences that matter to us as genealogists.

In order to find those differences, we must be able to line up the genomes of the various people tested, on top of each other, so that we can measure from the locations that are the same.

Here’s an example.  All 4 people in this table above match exactly on locations 1-7, 9- 10 and 13-15.

Locations 8, 11 and 12 are areas that are more unstable, meaning that the people are not the same at that location, although they may not match each other, hence the different colored cells.

From this model, we know that we can align most people’s results on the green locations where everyone matches everyone else because we are all human.

The other locations may be the same or different, but they can’t be aligned reliably by relying on the map. You can read more about the complexity of this topic here and a good article, here.

A New Model

The challenge is that between 2009 and 2013, new locations were discovered in previously unmapped areas of the genome.

Think of genome locations as kids sitting in assigned seats side by side in a row.

Where do we put the newly discovered kids?

They have to crowd in someplace onto our existing map.

We have to add chairs between locations. The white rows below represent the newly discovered locations.

When we add chairs, the “addresses” of the kids currently sitting in chairs will change.  In fact, the address of everyone on the street might change because everyone has shifted.  Many of the actual kids will be the same, but some will be new, even though all of the kids will be referenced by new addresses.

This is a very simplified conceptual explanation of a complex process which isn’t simple at all.  In addition to addressing, this process has to deal with DNA insertions, deletions, STR markers which are repeats of segments, palindromic mutations as well as pseudo-autosomal regions of the Y chromosome. Additionally, not all reads or calls are valid, for a number of reasons. Due to all these factors, after the realignment is complete, analysis has to follow.

Suffice it to say that converting from one version to the next requires the data to be reanalyzed with a new filter which requires a massive amount of computational power.

Then, the wheat has to be sorted from the chaff.

Discovery

The conversion to hg38 has been a boon for discovery, already.  For example, Dr. Michael Sager, “Dr. Big Y” at Family Tree DNA has been busily working through the phylotree to see what the new alignment provides.

In November, he mentioned that he had discovered correct placement for a new haplogroup, high in the R1b tree, that joined together several subclades of U106.

In hg19, U106 had 9 subclades, all of which then branched downwards.

However, in hg38, utilizing the newly aligned genome, Michael can see that U106 has been reconfigured and looks like this instead.

Look at the difference!

  • Two new haplogroups have been placed in their proper location in the tree; Z2265 and BY30097.
  • A2150 has been repositioned.
  • Because of the placement of A2150 and Z2265, U106 now only has two direct branches.
  • S19589 has been moved beneath Z2265
  • The remaining 7 peach colored haplogroups in the old tree are now subclades of BY30097.

You may not know or realize that this shuffle occurred, but it has and it’s an important scientific discovery that corrects earlier versions of the phylotree.

Congratulations Dr. Sager!

So, how does the conversion to hg38 affect customers directly?

The Conversion

In or about October 2017, Family Tree DNA began their conversion to hg38. Keep in mind that no other vendor has to do this, because no other vendor provides testing at this level for Y DNA, combined with matching.

Not only that, but there is no funding for their investment in resources to do the conversion.  By that I mean that once you purchase the product, there is no annual subscription or anything else to fund development of this type.

Additionally, Family Tree DNA designed a new user interface for the enhanced Big Y which includes a new Big Y browser.

The initial conversion has been complete for some time, although tweaking is still occurring and some files are being reconverted when problems are discovered.  Now, the backlog of tests that accumulated during the conversion and during the holiday sale are being processed.

So, what does this mean to the consumer?  How do we work with the new results?  What has changed and what does all of this mean?

It’s an exciting time. We’re all waiting for new matches.

I’m going to step through the features and functions one at a time, explaining the new functionality and then what is different, and why.

First Look

On your personal page, you have Big Y Results and Big Y Matches.

Either selection takes you the same page, but with a different tab highlighted.

Named Variants

Named variants are SNPs that are already known and have been given SNP names.

At the bottom of the page, you can see that this person has 946 SNPs out of 77,722 currently on the tree.  Many SNPs on the tree are equivalent to each other.

The information about each SNP on this page shows that it’s derived, meaning it’s a mutation and not ancestral which is the original state of the DNA.

If you look closely, you’ll see that some of the Reference and Genotype values are the same.  You would logically expect them to be different.  These are genuine mutations, but they are listed as the same because in hg19, the reference model, which is a composite, is skewed towards haplogroup R.  In haplogroup R, these values are the same as the person tested (who is R-BY490), so while these are valid mutations on the tree of humanity, they are derived and found in all of haplogroup R. The same thing happens to some extent with all haplogroups because the reference sequence is a composite of all haplogroups.

The next column indicates whether the SNP has or hasn’t yet been placed on the Y tree.

The Reference column refers to the value at this address shown in the hg38 reference model, and the Genotype column shows the tester’s result at that location.

The confidence column shows the confidence level that Family Tree DNA has in this call. Let’s talk about confidence levels for a minute, and what they mean.

Confidence Levels

The Big Y test scans the Y chromosome, looking for specific blips at certain addresses.  Every location has a “normal” blip for the Y chromosome as determined by the reference model.  Any blips that vary from the reference model are flagged for further evaluation.

Blips can be caused by a mutation, a read error or a complex area of DNA, which is why there is a threshold for a minimum number of scans to find that same anomaly at any single location.

The area considered the “gold standard” portion of the Y chromosome which is useful genealogically is scanned between 55 and 80 times.  Then the scans are aligned and compared to each other, with the blips at various locations being reported.

The relevance of blips can vary by location and what is known as density in various regions.  In general, blips are not considered to be relevant unless they are recorded a minimum of 5 to 8 times, depending on the region of the Y chromosome.  At that level, Family Tree DNA reports them as a medium confidence call. High confidence calls are reported a minimum of 10 times.

Some individuals and third-party companies read the BAM files and offer analysis, often project administrators within haplogroup projects.  Depending on the circumstances, they may suggest that as few at 2 blips are enough to consider the blip a mutation and not a read error.  Therefore, some third-party analysis will suggest additional haplogroups not reported by Family Tree DNA. Project administrators often collaborate with Dr. Sager to coordinate the placement of SNPs on the tree.

Therefore, at Family Tree DNA:

  • You will see only medium and high confidence calls for SNPs.
  • Over time, your Unnamed Variants will disappear as they are named and become Named Variants with SNP names.
  • When Unnamed Variants become Named Variants, which are SNPs that have been named, they are eligible to be added to the Y tree.
  • If the SNP added to the Y tree is below your present terminal SNP, you may one day discover that you have a new terminal SNP, meaning new haplogroup, listed on your main page. If the new SNP is within 5 upstream of your terminal SNP, looking backward up the tree, you’ll see it appear in your mini-tree on your personal page and on your larger Haplogroup and SNP page.

Unnamed Variants

Unnamed variants are newer mutations that have not yet been named as SNPs.

In order for a mutation to be considered a SNP, in true genetics terms, it has to be found in over 1% of the population.  Otherwise, it’s considered a private, personal, family or clan mutation.

However, in reality, Family Tree DNA attempts to figure out which SNPs are being found often enough to warrant the assignment of a SNP number which means they can be placed on the haplotree of humanity, and which SNPs truly are going to be private “family mutations.”  Today, nearly all mutations found in 3 or more individuals that are considered high confidence calls are named as SNPs.

Both named and unnamed variants are a good thing.  New SNPs help expand and grow the tree.  Personal or family SNPs can be utilized in the same fashion as STR markers.  Eventually, as new SNPs are categorized and named, they will be moved from your Unnamed Variants page and added to your Named Variants page.

If you had results in the hg19 version, your unnamed variants will have changed.  Just like those kids sitting on the bleachers, your old variants are either:

  • Still here but with a new name
  • Have been given SNP names and are now on your Named Variants list

The great news is that you’ll very probably have new variants too, resulting from the new hg38 reference model and more accurate alignment.

If you’re really a die-hard and want to know which hg19 locations are now hg38 locations, you can do the address conversion here.  I am a die-hard but not this much of a die-hard, plus, I didn’t record the previous novel variant locations for my kits.  Dr. Sager who has run this program tells me that you only need to pay attention to the two drop down menus specifying the “original” and “new” assemblies when utilizing this tool.

Y Chromosome Browser Tool

You’ve probably already noticed the really new cool browser tool, positioned tantalizingly to the right of both results tabs.

Go ahead and click on either a SNP name or an unnamed variant.

Either one will cause a pop up box to open displaying the location you’ve selected in the Big Y browser.

Utilizing the new Y chromosome browser tool, you can see the number of times that a specific SNP was called as positive or negative during the scan of your Y DNA at that specific location.

To see an example, click on any SNP on the list under the SNP Name column.

The Y chromosome browser tool opens up at the location of the SNP you selected.

The SNP you selected is displayed in pink with a downward arrow pointing to the position of the SNP. The other pink locations display other nearby SNP positions.

See that one single pink blip to the far right in the example above?  That’s a good example of just one call, probably noise.  You can see the difference between that one single call and high confidence reads, illustrated by the columns of pink SNP reads lined up in a row.

You can click on any of your SNP positions, named or unnamed, to see more information for that specific SNP.

Pink indicates that a mutation, or derived value, was found at that location as compared to the ancestral value found in the reference model.

Blue rows and green rows indicate that the forward (blue) or reverse (green) strand was being read.

The intensity of the colors indicates the relative strength of the read confidence, where the most intense is the highest confidence.

The value listed at the top, T, A, C or G is the abbreviation for the ancestral reference nucleobase value found in the reference population at that genetic location, and the value highlighted in pink is the derived (mutated) value that you carry.

Confidence is a statistical value calculated based upon the number of scans, the relative quality of that part of the Y chromosome and the number of times that derived value was found during scanning.

I love this new tool.

I hope that in the next version, Family Tree DNA will include the ability to look at additional locations not on the list.

For example, I was recently working on a Personalized DNA Report where the SNP below the tester’s terminal SNP was not called one way or another, positive or negative.  I would have liked to view his results for that SNP location to see if he has any blips, or if the location read at all.

Matching

The third tab displays your Big Y matches and a mini-tree of your 5 SNPs at the end of your own personal branch of the haplotree.

Your terminal SNP determines the terminal (final or lowest) subbranch (on the Y-DNA haplotree) to which you belong.

On your mini-tree, your terminal SNP (R-BY490 above) is labeled YOU.

The number of people you match on those SNPs utilizing the new matching algorithm is displayed at each branch of the tree.

The matches shown above are the matches for this person’s terminal SNP. To see the people matching on the next branch above the terminal SNP, click on R-BY482.

The number listed beside these SNPs on your 5 step mini-tree is NOT the total number of people you match on that branch, only the number you match on that branch AFTER the matching algorithm is applied.

I put this in bold red, because based on the previous matching algorithm that managed to include everyone on your terminal SNP, it’s easy to presume the new version shows everyone in the system who matches you on that SNP – and it doesn’t necessarily.  If assume it does or expect that it will, you’re likely to be wrong. There is a significant amount of confusion surrounding this topic in the community.

New Matching Algorithm

The Family Tree DNA matching algorithm has changed substantially. It needed to be updated, as the old matching algorithm had been outgrown with the dramatic new number of SNPs discovered and placed on the phylotree. Family Tree DNA created the original matching software when the Big Y was new and it was time for a refresh. In essence, the Big Y testing and tree-building has been successful beyond anyone’s wildest dreams and the matching routine became a victim of its own success.

Previously, Family Tree DNA used a static list of somewhere around 6,000 SNPs as compared to over 350,000 today, of which more than 78,000 have been placed on the tree. By the way, this SNP number grows with every batch of Big Y results because new SNPs are always found.

The previous threshold for mismatches was 4 SNPs. As time went on, this combination of a growing tree and a static SNP list caused increasingly irrelevant matches.

For example, in some instances, haplogroup U106 people matched haplogroup P312 people, two main branches of the R1b haplotree, because when compared to the old SNP list, they had less than 4 SNP mismatches.

The new Big Y matching routine expands as the new tree grows, and isn’t limited.  This means that people who were shown as matches to haplogroups far upstream (e.g. P312/U106), whose common ancestor lived many thousands of years ago, won’t be shown as matches at that level anymore.

Many people had hundreds of matches and complained that they were being shown matches so distant in time that the information was useless to them.

The previous Big Y version match criteria was:

  • 4 or less differences in Known SNPs (now Named Variants.)
  • In addition, you could have unlimited differences in Unnamed Variants, then called Novel Variants.

Family Tree DNA has attempted to make the matching algorithm more genealogically relevant by applying a different type of threshold to matching.

In the current Big Y version, a person is considered a match to you if they have BOTH of the following:

  • 30 or fewer differences in total SNPs (named and unnamed variants combined.)
  • Their haplogroup is downstream from your terminal SNP haplogroup or downstream from your four closest parent haplogroups, meaning any of the 5 haplogroups shown on your 5 step mini-tree.

Here’s the logic behind the new matching algorithm threshold.

SNP mutations happen on the average of one every 100 years.  This number is still discussed and debated, but this estimate is as good as any.

If your common ancestor through two men had two sons, 1500 years ago, and each line incurred 1 mutation every hundred years, at the end of 1500 years, the number of mutations between the two men would be approximately 30.

Family Tree DNA felt that 1500 years was a reasonable cutoff for a genealogical timeframe, hence the new matching threshold of 30 mutations difference.

The new match criteria is designed to reflect your matches that are most closely related to you.  In other words, the people on your match list should be related to you within the last approximate 1500 years, and people not on your match list who have taken the Big Y are separated from you by at least 30 mutations.

There may be people in the data base that match you on your terminal SNP and any or all of the SNPs shown on your mini-tree, but if you and they are separated by more than 30 differences (including both named and unnamed variants) on the Y chromosome, they will not be shown as a match.  

By clicking on the SNP name on your mini-tree, at right, you can see all of the people who match you with less than 30 differences total at each level, and who carry that particular Named Variant (SNP). The example shown above show this person’s matches on their terminal SNP. If they were to click on BY482, the next step up, they would then see everyone on their match list who is positive for that SNP.

On your match page, you can search for a specific surname, nonmatching variants or match date.

The Shared Variants column is the total number of shared variants you have with the match in question.  According to the lab at Family Tree DNA, this number very high because it is reflective of many ancient variants.

You can also download your data from this page into a spreadsheet.

The Biggest Differences

What you don’t receive today, that you did receive before, is a comprehensive list of who you match on your terminal and upstream SNPs.

For example, I was working with someone’s results this week.  They had no matches, as shown below.

However, when I went to the relevant haplogroup project page, I discovered that indeed, there are at least 4 additional individuals who do share the same terminal SNP, but the tester would never know that from their Big Y results alone, if they didn’t check the project results page.

Of course, it’s unlikely that every person who takes the Big Y test joins a Y DNA project, or the same Y DNA project.  Even though projects will show some matches, assuming that the administrator has the project grouped in this manner, there is no guarantee you are seeing all of your terminal SNP matches.

Project administrators, who have been instrumental in building the tree can also no longer see who matches on terminal SNPs, at least not if they are separated by more than 30 mutations. This hampers their ability to build the Y tree.

This matching change makes it critical that people join projects AND make their results viewable to project members as well as publicly.  Most people don’t realize that the default when joining projects is that ONLY project members can see their results in the project. In other words, the results are available in the public project, like the screenshot above.

You can read more about Family Tree DNA’s privacy settings here.

Another result of the matching algorithm change is that in some cases, one man may match a second man, but the second man does not show up on the first man’s match list.

I know that sounds bizarre, but in the Estes project, we have that exact scenario.

The chart above shows that none of the Estes Big Y participants match kit number 166011, also an Estes male, but kit 166011 does show matches to all of those Estes men.

Kit 166011 is the one to the far right on the pedigree chart above, and he is descended from a different son of Robert born in 1555 than the rest of the men.  Counting from kit 166011 to Robert born in 1555 is 12 generations.  Counting from kits 244708 and 199378 to Robert is 10 generations, so a total of 22 generations between those men.

Kits 366707, 9993 and 13805 are 11 generations from the common ancestor, so a total of 23 generations.  Not only are these genealogically relevant, they carry the same surname.

The average of 30 mutations reaching to 1500 years doesn’t work in this case.  The cutoff was about 1555, or 462 years, not 1500 years – so the matching algorithm failed at 30% of the estimated time it was supposed to cover.  I guess this just goes to prove that mutations really don’t happen on any type of a reliable schedule – and the average doesn’t always pertain to individual family circumstances.

If you’re wondering if these men match on STR markers, they do.

In this case, the Big Y doesn’t show matches in a timeframe that STR markers do – the exact opposite of what we would expect.

One of the benefits of the Big Y, previously, was the ability to view people of other surnames who matched your SNP results.  This ability to peer back into time informed us of where our ancestors may have been prior to where we found them.  While this isn’t genealogy, per se, it’s certainly family history.

A good case in point is the Scottish clans and how men with different surnames may be related.

As a family historian I want to know who I match on my terminal SNP and the direct upstream SNPs so I can walk this line back in time.

What’s Coming

At the conference in Houston in November, Elliott Greenspan discussed a new direction for the Big Y in 2018.  The new feature that all Big Y testers are looking forward to is the addition of STRs beyond the 111 marker panels, extracted from the Big Y as a standard product offering. Meaning free for Big Y testers.

The 111 and lower panels will continue to be tested on their current Sanger platform.  Analysis of more than 3700 samples in the data base that have both the Big Y and 111 markers indicate that only 72 of the 111 STR markers can be reliably and consistently extracted from the Big Y NGS scan data. The last thing we want is unreliable NGS data being compared to our Sanger sequenced STR values. We need to be able to depend on those results as always being reliable and comparable to each other. Therefore, only STR markers above 111 will be extracted from the Big Y and the original 111 STR markers will continue to be sold in panels, the same as today.

However, because of the nature of scanning DNA as opposed to directly testing locations, all of the markers above 111 will not be available for everyone. Some marker locations will fail to read, or fail to read reliably.  These won’t necessarily be the same markers, but read failure will apply to some markers in just about every individual’s scan.  Therefore, these additional STR markers will be supplemental to the regular 111 STR markers. You get what you get.

How many additional markers will be available through Big Y?  That hasn’t been finalized yet.

Elliott said that in order to reliably obtain 289 additional markers, they need to attempt to call 315.  To get 489, they have to attempt more than 600, and many are less useful.

Therefore, speculating, I’d guess that we’ll see someplace between 289 and 489, the numbers Elliott mentioned.

Are you salivating yet?

Given that the webpage and display tools have to be redesigned for both individuals’ results, project pages and project administrators’ tools, I’d guess that we won’t see this addition until after they get the kinks worked out of the hg38 conversion and analysis.

It’s nice to know that it’s on the way though. Something to look forward to later in 2018.

In Summary

I know that the upgrade to hg38 had to be done, but I hated to see it.  These things never go smoothly, no matter who you are and this was a massive undertaking.

I’m glad that Family Tree DNA is taking this opportunity to innovate and provide the community with the nifty new Y DNA browser.

I’m also grateful that they listen to their customers and make an effort to implement changes to help us along the genealogy path.

However, sometimes things fall into the well of unintended consequences.  I think that’s what’s happening with the new matching routine. I know that they are continuing to work to tweek the knobs and refine the results, so you’re likely to see changes over the next few months. It’s not like there was a pattern or recipe anyplace.  This has never been done before.

Here’s a list of changes and updates I’d suggest to improve the new hg38 Big Y experience:

  • In addition to threshold matching, an option for direct SNP tree matching through the 5 SNPs shown on the participant’s 5 step mini-tree, purely based on haplotree matching. This second option would replace the functionality lost with the 30-mutation threshold matching today.
  • A matches map of the most distant ancestors at each level of matching for both threshold matching and SNP tree matching.
  • An icon indicating whether a Big Y match is an STR match and which level of STR panel testing the match has completed. This means that we could tell at a glance that a Big Y match has tested to 111 markers, but is only a match at 12.
  • An icon indicating if the Big Y match has also taken the Family Finder test, and if they are a match.
  • An icon on STR matches pages indicating that a match has taken a Big Y test and if they are a match.
  • Ability to query through the Big Y browser to SNP locations not on the list of named or unnamed variants.
  • Age estimates for haplogroups.

If you are seeing Big Y results that you find unusual or confusing, please notify Family Tree DNA support. There is a contact link with a form at the bottom of your personal page.  Family Tree DNA needs to be aware of problems and also of customer’s desires.

Family Tree DNA has indicated that they are soliciting customer feedback on the new Big Y matching and tools.

Please also join a relevant haplogroup project as well as a surname project, if you haven’t already. Here’s an article, What Project Do I Join?, to help you find relevant projects.

If you think you have an unnamed variant that should be named and placed on the phylotree, your haplogroup project administrator is the person who will work with you to verify that the unnamed variant is a good candidate and submit the unnamed variant to Family Tree DNA for naming.

If you are a project administrator having issues, questions or concerns, you can contact the group projects team at groups@ftdna.com.  Be sure that this address is in the “to” field, not the “cc” field as the e-mail will bounce otherwise.

Don’t forget that you can reference the Family Tree DNA Learning Center about your Big Y results.

Thank you to Dr. Sager for his assistance with this article.

_____________________________________________________________________

Standard Disclosure

This standard disclosure appears at the bottom of every article in compliance with the FTC Guidelines.

Hot links are provided to Family Tree DNA, where appropriate.  If you wish to purchase one of their products, and you click through one of the links in an article to Family Tree DNA, or on the sidebar of this blog, I receive a small contribution if you make a purchase.  Clicking through the link does not affect the price you pay.  This affiliate relationship helps to keep this publication, with more than 900 articles about all aspects of genetic genealogy, free for everyone.

I do not accept sponsorship for this blog, nor do I write paid articles, nor do I accept contributions of any type from any vendor in order to review any product, etc.  In fact, I pay a premium price to prevent ads from appearing on this blog.

When reviewing products, in most cases, I pay the same price and order in the same way as any other consumer. If not, I state very clearly in the article any special consideration received.  In other words, you are reading my opinions as a long-time consumer and consultant in the genetic genealogy field.

I will never link to a product about which I have reservations or qualms, either about the product or about the company offering the product.  I only recommend products that I use myself and bring value to the genetic genealogy community.  If you wonder why there aren’t more links, that’s why and that’s my commitment to you.

Thank you for your readership, your ongoing support and for purchasing through the affiliate link if you are interested in making a purchase at Family Tree DNA, or one of the affiliate links below:

Affiliate links are limited to:

Concepts – DNA Recombination and Crossovers

What is a crossover anyway, and why do I, as a genetic genealogist, care?

A crossover on a chromosome is where the chromosome is cut and the DNA from two different ancestors is spliced together during meiosis as the DNA of the offspring is created when half of the DNA of the two parents combines.

Identifying crossover locations, and who the DNA that we received came from is the first step in identifying the ancestor further back in our tree that contributed that segment of DNA to us.

Crossovers are easier to see than conceptualize.

Viewing Crossovers

The crossover is the location on each chromosome where the orange and black DNA butt up against each other – like a splice or seam.

In this example, utilizing the Family Tree DNA chromosome browser, the DNA of a grandchild is compared to the DNA of a grandparent. The grandchild received exactly 50 percent of her father’s DNA, but only the average of 25% of the DNA of each of her 4 grandparents. Comparing this child’s DNA to one grandmother shows that she inherited about half of this grandmother’s DNA – the other half belonging to the spousal grandfather.

  • The orange segments above show the locations where the grandchild matches the grandmother.
  • The black sections (with the exception of the very tips of the chromosomes) show locations where the grandchild does not match the grandmother, so by definition, the grandchild must match the grandfather in those black locations (except chromosome tips).
  • The crossover location is the dividing line between the orange and black. Please note that the ends of chromosomes are notoriously difficult and inconsistent, so I tend to ignore what appear to be crossovers at the tips of chromosomes unless I can prove one way or the other. Of the 22 chromosomes, 16 have at least one black tip. In some cases, like chromosome 16, you can’t tell since the entire chromosome is black.
  • Ignore the grey areas – those regions are untested because they are SNP poor.

We know that the grandchild has her grandmother’s entire X chromosome, because the parent is a male who only inherited an X chromosome from his mother, so that’s all he had to give his daughter. The tips of the X chromosome are black, showing that the area is not matching the mother, so that region is unstable and not reported.

It’s also interesting to note that in 6 cases, other than the X chromosome, the entire chromosome is passed intact from grandparent to grandchild; chromosomes 4, 11, 16, 20, 21 and 22.

Twenty-six crossovers occurred between mother and son, at 5cM.  This was determined by comparing the DNA of mother to son in order to ascertain the actual beginning and end of the chromosome matching region, which tells me whether the black tips are or are not crossovers by comparing the grandchild’s DNA to the grandmother.

For more about this, you might want to read Concepts – Segment Survival – Three and Four Generation Phasing.

Before going on, let’s look at what a match between a parent and child looks like, and why.

Parent/Child Match

If you’re wondering why I showed a match between a grandchild and a grandparent, above, instead of showing a match between a child and a parent, the chromosome browser below provides the answer.

It’s a solid orange mass for each chromosome indicating that the child matches the parent at every location.

How can this be if the child only inherits half of the parent’s DNA?

Remember – the parent has two chromosomes that mix to give the child one chromosome.  When comparing the child to the parent, the child’s single chromosome inherited from the parent matches one of the parent’s two chromosomes at every address location – so it shows as a complete match to the parent even though the child is only matching one of the parent’s two of chromosome locations.  This isn’t a bug and it’s just how chromosome browsers work. In other words, the “other ” chromosome that your parents carry is the one you don’t match.

The diagram below shows the mother’s two copies of chromosome 1 she inherited from her father and mother and which section she gave to her child.

You can see that the mother’s father’s chromosome is blue in this illustration, and the mother’s mother’s chromosome is pink.  The crossover points in the child are between part B and C, and between part C and D.  You can clearly see that the child, when compared to the mother, does in fact match the mother in all locations, or parts, 3 blue and 1 pink, even though the source of the matching DNA is from two different parents.

This example shows the child compared to both parents, so you can see that the child does in fact match both parents on every single location.

This is exactly why two different matches may match us on the same location, but may not match each other because they are from different sides of our family – one from Mom’s side and one from Dad’s.

You can read more about this in the article, One Chromosome, Two Sides, No Zipper – ICW and the Matrix.

The only way to tell which “sides” or pieces of the parent’s DNA that the child inherited is to compare to other people who descend from the same line as one of the parents.  In essence, you can compare the child to the grandparents to identify the locations that the child received from each of the 4 grandparents – and by genetic subtraction, which segments were NOT inherited from each grandparent as well, if one grandparent happens to be missing.

In our Parental Chromosome pink and blue diagram illustration above, the child did NOT inherit the pink parts A, B and D, and did not inherit the blue part C – but did inherit something from the parent at every single location. They also didn’t inherit an equal amount of their grandparents pink and blue DNA. If they inherited the pink part, then they didn’t inherit the blue part, and vice versa for that particular location.

The parent to child chromosome browser view also shows us that the very tip ends of the chromosomes are not included in the matching reports – because we know that the child MUST match the parent on one of their two chromosomes, end to end. The download or chart view provides us with the exact locations.

This brings us to the question of whether crossovers occur equally between males and female children.  We already know that the X chromosome has a distinctive inheritance pattern – meaning that males only inherit an X from their mothers.  A father and son will NEVER match on the X chromosome.  You can read more about X chromosome inheritance patterns in the article, X Marks the Spot.

Crossovers Differ Between Males and Females

In the paper Genetic Analysis of Variation in Human Meiotic Recombination by Chowdhury, et al, we learn that males and females experience a different average number of crossovers.

The authors say the following:

The number of recombination events per meiosis varies extensively among individuals. This recombination phenotype differs between female and male, and also among individuals of each gender.

Notably, we found different sequence variants associated with female and male recombination phenotypes, suggesting that they are regulated by different genes.

Meiotic recombination is essential for the formation of human gametes and is a key process that generates genetic diversity. Given its importance, we would expect the number and location of exchanges to be tightly regulated. However, studies show significant gender and inter-individual variation in genome-wide recombination rates. The genetic basis for this variation is poorly understood.

The Chowdhury paper provides the following graphs. These graphs show the average number of recombinations, or crossovers, per meiosis for each of two different studies, the AGRE and the FHS study, discussed in the paper.

The bottom line of this paper, for genetic genealogists, is that males average about 27 crossovers per child and females average about 42, with the AGRE study families reporting 41.1 and the FHS study families reporting 42.8.

I have been collaborating with statistician, Philip Gammon, and he points out the following:

Male, 22 chromosomes plus the average of 27 crossovers = an average of 49 segments of his parent’s DNA that he will pass on to his children. Roughly half will be from each of his parents. Not exactly half. If there are an odd number of crossovers on a chromosome it will contain an even number of segments and half will be from each parent. But if there are an even number of crossovers (0, 2, 4, 6 etc.) there will be an odd number of segments on the chromosome, one more from one parent than the other.

The average size of segments will be approximately:

  • Males, 22 + 27 = 49 segments at an average size of 3400 / 49 = 69 cM
  • Females, 22 + 42 = 64 segments at an average size of 3400 / 64 = 53 cM

This means that cumulatively, over time, in a line of entirely females, versus a line of entirely males, you’re going to see bigger chunks of DNA preserved (and lost) in males versus females, because the DNA divides fewer times. Bigger chunks of DNA mean better matching more generations back in time. When males do have a match, it would be likely to be on a larger segment.

The article, First Cousin Match Simulations speaks to this as well.

Practically Speaking

What does this mean, practically speaking, to genetic genealogists?

Few lines actually descend from all males or all females. Most of our connections to distant ancestors are through mixtures of male and female ancestors, so this variation in crossover rates really doesn’t affect us much – at least not on the average.

It’s difficult to discern why we match some cousins and we don’t match others. In some cases, rather than random recombination being a factor, the actual crossover rate may be at play. However, since we only know who we do match, and not who tested and we don’t match, it’s difficult to even speculate as to how recombination affected or affects our matches. And truthfully, for the application of genetic genealogy, we really don’t care – we (generally) only care who we do match – unless we don’t match anyone (or a second cousin or closer) in a particular line, especially a relatively close line – and that’s a horse of an entirely different color.

To me, the burning question to be answered, which still has not been unraveled, is why a difference in recombination rates exists between males and females. What processes are in play here that we don’t understand? What else might this not-yet-understood phenomenon affect?

Until we figure those things out, I note whether or not my match occurred through primarily men or women, and simply add that information into the other data that I use to determine match quality and possible distance.  In other words, information that informs me as to how close and reasonable a match is likely to be includes the following information:

  • Total amount of shared DNA
  • Largest segment size
  • Number of matching segments
  • Number of SNPs in matching segment
  • Shared matches
  • X chromosome
  • mtDNA or Y DNA match
  • Trees – presence, absence, accuracy, depth and completeness
  • Primarily male or female individuals in path to common ancestor
  • Who else they match, particularly known close relatives
  • Does triangulation occur

It would be very interesting to see how the instances of matches to a certain specific cousin level – say 3rd cousins (for example), fare differently in terms of the average amount of shared DNA, the largest segment size and the number of segments in people descended from entirely female and entirely male lines. Blaine Bettinger, are you listening? This would be a wonderful study for the Shared cM Project which measures actual data.

Isn’t the science of genetics absolutely fascinating???!!!

______________________________________________________________________

Standard Disclosure

This standard disclosure will now appear at the bottom of every article in compliance with the FTC Guidelines.

Hot links are provided to Family Tree DNA, where appropriate. If you wish to purchase one of their products, and you click through one of the links in an article to Family Tree DNA, or on the sidebar of this blog, I receive a small contribution if you make a purchase. Clicking through the link does not affect the price you pay. This affiliate relationship helps to keep this publication, with more than 850 articles about all aspects of genetic genealogy, free for everyone.

I do not accept sponsorship for this blog, nor do I write paid articles, nor do I accept contributions of any type from any vendor in order to review any product, etc. In fact, I pay a premium price to prevent ads from appearing on this blog.

When reviewing products, in most cases, I pay the same price and order in the same way as any other consumer. If not, I state very clearly in the article any special consideration received. In other words, you are reading my opinions as a long-time consumer and consultant in the genetic genealogy field.

I will never link to a product about which I have reservations or qualms, either about the product or about the company offering the product. I only recommend products that I use myself and bring value to the genetic genealogy community. If you wonder why there aren’t more links, that’s why and that’s my commitment to you.

Thank you for your readership, your ongoing support and for purchasing through the affiliate link if you are interested in making a purchase at Family Tree DNA.

Imputation Analysis Utilizing Promethease

We know in the genetics industry that imputation is either coming or already here for genetic genealogy. I recently wrote two articles, here and here, explaining imputation and its (apparent) effects on matching – or at least the differences between vendors who do and don’t utilize imputation on the segments that are set forth as matches.

I will be writing shortly about my experience utilizing DNA.Land, a vendor who encourages testers to upload their files to be shared with medical researchers. In return, DNA.Land provides matching information and ethnicity – but they do impute results that you don’t have based on“typical” DNA that is generally inherited with the DNA you do have.

Aside from my own curiosity and interest in health, I have been attempting to determine the relative accuracy of imputation.

Promethease is a third party site that provides consumers who upload their autosomal DNA files with published information about their SNPs, mutations, either bad, good or neither, meaning just information. This makes Promethease the perfect avenue for comparing the accuracy of the imputed data provided by DNA.Land compared against the data provided by Promethease generated from files from vendors who do not impute.

Even better, I can directly compare the autosomal file from Family Tree DNA that I uploaded to DNA.Land with my resulting DNA.Land file after DNA.Land imputed another 38 million locations. I can also compare the DNA.Land results to an extensive exome test that provided results for some 50 million locations.

Uploading all of the files from various testing vendors separately to Promethease allows me to see which of the mutations imputed by DNA.Land are accurate when compared to actual DNA tests, and if the imputed mutations are accurate when the same location was tested by any vendor.

In addition to the typical genetic genealogy vendors, I’ve also had my DNA exome sequenced, which includes the 50 million locations in humans most likely to mutate.  This means those locations should be the locations most likely to be imputed by DNA.Land.

Finally, at Promethease, I can combine my results from all the vendors where I actually tested to provide the greatest coverage of actually tested locations, and then compare to DNA.Land – providing the most comprehensive comparison.

I will utilize the testing vendors’ actual results to check the DNA.Land imputed results.

Let’s see what the results produce.

The Test Process

The method I used for this comparison was to upload my Family Tree DNA autosomal raw data file to DNA.Land. DNA.Land then took the 700,000+ locations that I did test for at Family Tree DNA, and imputed more than 38 million additional locations, raising my tested and imputed number of locations to about 39 million.

Then, I downloaded and uploaded my huge DNA.Land file, utilizing the Promethease instructions.

In order to do a comparison against the imputed data that DNA.Land provided, I uploaded files from the following vendors individually, one at a time, to Promethease to see which versions of the files provided which results – meaning which mutations the files produced by actual testing at vendors could confirm in the DNA.Land imputed results.

  • DNA.Land (imputed)
  • Genos – Exome testing of 50 million medically relevant locations
  • Ancestry V1 test
  • Ancestry V2 test
  • Family Tree DNA
  • 23andMe V3 test
  • 23andMe V4 test
  • Combined file of all non-imputed vendor files

Promethease provides a wonderful feature that enables users to combine multiple vendors’ files into one run. As a final test, I combined all of my non-imputed files into one run in order to compare all of my non-imputed results, together, with DNA.Land’s imputed results.

Promethease provides results that fall into 3 categories:

  • Bad – red
  • Good – green
  • Grey – “not set” – neither bad nor good, just information

Promethease does not provide diagnoses of any form, just information from the published literature about various mutations and genetic markers and what has been found in research, with links to the sources through SNPedia.

Results

I compiled the following chart with the results of each individual file, plus a combined file made up of all of the non-imputed files.

The results are quite interesting.

The combined run that included all of the vendors files except for DNA.Land provided more “bad” results than the imputed DNA.Land file. 

I expected that the Genos exome test would have covered all of the locations tested by the three genetic genealogy vendors, but clearly not, given that the combined run provides more results than the Genos exome run by itself. In fact, the total locations reported is 80,607 for the combined run and the Genos run alone was only 45,595.

DNA.Land only imputed 34,743 locations that returned results.

Comparison for Accuracy

Now, the question is whether the DNA.Land imputed results are accurate.

Due to the sheer number of results, I focused only on the “bad” results, the ones that would be most concerning, to get an idea of how many of the DNA.Land results were tested in the original uploaded file (from FTDNA) and how many were imputed. Of the imputed locations, I determined how many are accurate by comparing the DNA.Land results to the combined testing results. My hope, is, of course, that most of the locations found in the DNA.Land imputed file are also to be found in one of the files tested at the vendors, and therefore covered in the combined file run.

I combined my results from the following 3 runs into a common spreadsheet, color coding each result differently:

  • First, I wanted to see the locations reported as “bad” that were actually tested at FTDNA. By comparing the FTDNA locations with the DNA.Land imputed file, we know that DNA.Land was NOT imputing those locations, and conversely, that they WERE imputing the rest of the locations.
  • Second, I wanted to know if locations imputed by DNA.Land and reported as “bad” had been tested by any testing company, and if DNA.Land’s imputation was accurate as compared to an actual test.

You can read more about how Promethease reports results, here.

I’m showing two results in the spreadsheet example, below.

White row=FTDNA test result
Yellow row =DNA.Land result
Blue row=combined test result

These two examples show two mutations that are ranked as “bad” for the same condition. This result really only tells me that I metabolize some things slower than other people. Reading the fine print tells me this as well:

The proportion of slow and rapid metabolizers is known to differ between different ethnic populations. In general, the slow metabolizer phenotype is most prevalent (>80%) in Northern Africans and Scandinavians, and lowest (5%) in Canadian Eskimos and Japanese. Intermediate frequencies are seen in Chinese populations (around 20% slow metabolizers), whereas 40 – 60% of African-Americans and most non-Scandinavian Caucasians are slow metabolizers.[PMID 16416399]

Many of you are probably slow metabolizers too.

I used this example to illustrate that not everything that is “bad” is going to keep you awake at night.

The first mutation, gs140 is found in the DNA.Land file, but there is no corresponding white row, representing the original Family Tree DNA report, meaning that DNA.Land imputed the result. GS140 is, however, tested by some vendor in the combined file. The results do match (verified by actually comparing the results individually) and therefore, the DNA.Land imputation was accurate as noted in the DNA.Land Analysis column at far right.

In the second example, gs154 is reported by DNA.Land, but since it’s also reported by Family Tree DNA in the white row, we know that this value was NOT imputed by DNA.Land, because this was part of the originally uploaded file. Therefore, in the Analysis column, I labeled this result as “tested at FTDNA.”

Analysis

I analyzed each of the rows of “bad” results found in the DNA.Land file by comparing them first to the FTDNA file and then the Combined file. In some cases, I needed to return to the various vendor results to see which vendor had done the testing on a specific location in order to verify the result from the individual run.

So, how did DNA.Land do with imputing data as compared with actual tested results?

# Results % Comment
Tested, not Imputed 171 38.6 This “bad” location was tested at FTDNA and uploaded, so we know it was reported accurately at DNA.Land and not imputed.
Total Imputed* 272 61.4 Meaning total of “bad” results not tested at FTDNA, so not uploaded to DNA.Land, therefore imputed.
Imputed Correctly 259 95.22 This result was verified to match a tested location in the combined run.
Imputed, but not tested elsewhere 6 2.21 Accuracy cannot be confirmed.
Conflict 3 1.10 DNA.Land results cannot be verified due to an error of some sort – two of these three are probably accurate.
Imputed Incorrectly 4 1.47 Confirmed by the combined run where the location was actually tested at multiple vendor(s).
Not reported, and should have been 1 0.37 4 other vendor tests showed this mutation, including FTDNA which was uploaded to DNA.Land. Therefore these locations should have been reported by the DNA.Land file.

*The total number of “bad” results was 443, 171 that were tested and 272 that were imputed. Note that the percentages of imputations shown below the “Total Imputed” number of 272 are calculated based on the number of locations imputed, not on the total number of locations reported.

Concerns, Conflicts and Errors

It’s worth noting that my highest imputed “bad” risk from DNA.Land was not tested elsewhere, so cannot be verified, which concerns me.

On the three results where a conflict exists, all 3 locations were tested at multiple other vendors, and the results at the other vendors where the results were actually tested show different results from each other, which means that the DNA.Land result cannot be verified as accurate. Clearly, an error exists in at least one of the other tests.

In one conflict case, this error has occurred at 23andMe on either their V3 or V4 chip, where the results do not match each other.

In a second conflict case, two of the other vendors agree and the DNA.Land imputation is likely accurate, as it matches 2 of the three other vendor tests.

In the third conflict case, the Ancestry V2 test confirms one of the 23andMe results, which matches the DNA.Land results, so the DNA.Land result is likely accurate.

Of the 4 results that were confirmed to be imputed incorrectly, all locations were tested at multiple vendors. In two cases, the location was confirmed on two other tests and in the other two cases, the location was tested at three vendors. The testing vendor’s results all matched each other.

Summary

Overall, given the problems found with both DNA.Land and MyHeritage, who both impute, relative to genetic genealogy matching, I was surprised to find that the DNA.Land imputed health results were relatively accurate.

I expected the locations reported in the FTDNA file to be reported accurately by DNA.Land, because that data was provided to them. In one case, it was not.

Of the 272 “bad” results imputed, 259, or 95.22% could be verified as accurate.

Six could not be verified, and three were in conflict, but of those, it’s likely that two of the three were imputed accurately by DNA.Land. The third can’t be verified. This totals 3.31% of the imputed results that are ambiguous.

Only 1.47% were imputed incorrectly. If you add the .37% for the location that was not reported and should have been, and make the leap of assumption that the one of three in conflict is in error, DNA.Land is still just over a 2% confirmed error rate.

I can see why Illumina would represent to the vendors that imputation technology is “very accurate.” “Very” of course is relative, pardon the pun, in genetic genealogy, to how well matching occurs, not only when the new GSA chip is compared to another GSA chip, but when the new GSA version is compared to the older OmniExpress version. For backards compatibility between the chip versions, imputation must be utilized. Thanks a lot Illumina (said in my teenage sarcastic voice).

Since DNA.Land accepts files from all the vendors on all chips, for DNA.Land to be able to compare all locations in all vendors’ files against each other, the “missing” data in each file must be imputed. MyHeritage is doing something similar (having hired one of the DNA.Land developers), and both vendors have problems with genetic genealogy matching.

This begs the question of why the matching is demonstrably so poor for genetic genealogy. I’ve written about this phenomenon here, Kitty Cooper wrote about it here and Leah Larkin here.

Based on this comparison, each individual DNA.Land imputed file would contain about a 2% error rate of incorrectly imputed data, assuming the error rate is the same across the entire file, so a combined total of 4% for two individuals, if you’re just looking at individual SNPs. Perhaps entire segments are being imputed incorrectly, given that we know that DNA is inherited in segments. If that is the case, and these individual SNPs are simply small parts of entire segments that are imputed incorrectly, they might account for an equal number of false positive matches. In other words, if 10 segments are imputed incorrectly for me, that’s 10 segments reporting false positive matches I’ll have when paired against anyone who receives the same imputed data. However, that doesn’t explain the matches that are legitimate (on tested segments) and aren’t found by the imputing vendors, and it doesn’t explain an erroneous match rate that appears to be significantly higher than the 2-4% per cent found in this comparison.

I’ll be writing about the DNA.Land matching comparison experience shortly.

I would strongly prefer that medical research be performed on fully tested individuals. I realize that the cost of encouraging consumers to upload their data, and then imputing additional information is much less expensive than actual testing. However, accuracy is an issue and a 2% error rare, if someone is dealing with life-saving and life-threatening research could be a huge margin of error, from the beginning of the project, based on faulty imputation – which could be eliminated by simply testing people. This seems like an unnecessary risk and faulty research just waiting to happen. This error rate is on top of the actual sequencing error rate, but sequencing errors will be found in different locations in individuals, not on the same imputed segment assigned to multiple people in population groups. Imputation errors could be cumulative in one location, appearing as a hot spot when in reality, it’s an imputation error.

As related to genetic genealogy, I don’t think imputation and genetic genealogy are good bedfellows. DNA.Land’s matching was even worse when it was initially introduced, which is one reason I’ve waited so long to upload and write about the service.

Unfortunately, with Illumina obsoleting the OmniExpress chip, we’re not going to have a choice, sooner than later. All vendors who utilized the OmniExpress chip are being forced off, either onto the GSA chip or to an Exome or full sequence chip. The cost of sequencing for anything other than the GSA chip is simply more than the genetic genealogy market will stand, not to mention even larger compatibility issues. My Genos Exome test cost $499 just a few months ago and still sells for that price today.

The good news is that utilizing imputation, we will still receive matches, just less accurate matches when comparing the new chip to older versions, and when using imputation.

New testers will never know the difference. Testers not paying close attention won’t notice or won’t realize either. That leaves the rest of us “old timers” who want increased accuracy and specification, not less, flapping in the wind along with the vendors who don’t sell our test results into the medical arena and have no reason to move to the new GSA platform other than Illumina obsoleting the OmniExpress chip.

Like I said, thanks Illumina.

Imputation Matching Comparison

In a future article, I’ll be writing about the process of uploading files to DNA.Land and the user experience, but in this article, I want to discuss only one topic, and that’s the results of imputation as it affects matching for genetic genealogy. DNA.Land is one of three companies known positively to be using imputation (DNA.Land, MyHeritage and LivingDNA), and one of two that allows transfers and does matching for genealogy

This is the second in a series of three articles about imputation.

Imputation, discussed in the article, Concepts – Imputation, is the process whereby your DNA that is tested is then “expanded” by inferring results you don’t have, meaning locations that haven’t been tested, by using information from results you do have. Vendors have no choice in this matter, as Illumina, the chip maker of the DNA chip widely utilized in the genetic genealogy marketspace has obsoleted the prior chip and moved to a new chip with only about 20% overlap in the locations previously tested. Imputation is the methodology utilized to attempt to bridge the gap between the two chips for genetic genealogy matching and ethnicity predications.

Imputation is built upon two premises:

1 – that DNA locations are inherited together

2 – that people from common populations share a significant amount of the same DNA

An example of imputation that DNA.Land provides is the following sentence.

I saw a blue ca_ on your head.

There are several letters that are more likely that others to be found in the blank and some words would be more likely to be found in this sentence than others.

A less intuitive sentence might be:

I saw a blue ca_ yesterday.

DNA.Land doesn’t perform DNA testing, but instead takes a file that you upload from a testing vendor that has around 700,000 locations and imputes another 38.3 million variants, or locations, based on what other people carry in neighboring locations. These numbers are found in the SNPedia instructions for uploading DNA.Land information to their system for usage with Promethease.

I originally wrote about Promethease here, and I’ll be publishing an updated article shortly.

In this article, I want to see how imputation affects matching between people for genetic genealogy purposes.

Genetic Genealogy Matching

In order to be able to do an apples to apples comparison, I uploaded my Family Tree DNA autosomal file to DNA.Land.

DNA.Land then processed my file, imputed additional values, then showed me my matches to other people who have also uploaded and had additional locations imputed.

DNA.Land has just over 60,000 uploads in their data base today. Of those, I match 11 at a high confidence level and one at a speculative level.

My best match, meaning my closest match, Karen, just happened to have used her GedMatch kit number for her middle name. Smart lady!

Karen’s GedMatch number provided me with the opportunity to compare our actual match information at DNA.Land, then also at GedMatch, then compare the two different match results in order to see how much of our matching was “real” from portions of our tested kits that actually match, and what portion of our DNA matches as a result of the DNA.Land imputation.

At DNA.Land, your match information is presented with the following information:

  • Relationship degree – meaning estimated relationship
  • # shared segments – although many of these are extremely small
  • Total shared cM
  • Total recent shared length in cM
  • Longest recent shared segment in cM
  • Relationship likelihood graph
  • Shared segments plotted on chromosome display
  • Shared segments in a table

Please note that you can click on any graphic to enlarge.

DNA.Land provides what they believe to be an accurate estimate of recent and anciently shared SNA segments.

The match table is a dropdown underneath the chromosome graphic at far right:

For this experiment, I copied the information from the match table and dropped it into a spreadsheet.

DNALand Match Locations

My match information is shown at DNA.Land with Karen as follows:

Matching segments are identified by DNA.Land as either recent or ancient, which I find to be over-simplified at best and misleading or inaccurate at worst. I guess it depends on how you perceive recent and ancient. I think they are trying to convey the concept that larger segments tend to me more recent, and smaller segments tend to be older, but ancient in the genetics field often refers to DNA extracted from exhumed burials from thousands of years ago.  Furthermore, smaller segments can be descended from the same ancestor as larger segments.

GedMatch Match

Since Karen so kindly provided her GedMatch kit number, I signed in to GedMatch and did a one-to-one match with this same kit.

Since all of the segments are 3 cM and over at DNA.Land, I utilized a GedMatch threshold of 3 cM and dropped the SNP count to 100, since a SNP count of 300 gave me few matches. For this comparison, I wanted to see all my matches to Karen, no matter how few SNPs are involved, in an attempt to obtain results similar to DNA.Land. I normally would not drop either of these thresholds this low. My typical minimum is 5cM and 500 SNPs, and even if I drop to 3cM, I still maintain the 500 SNP threshold.

Let’s see how the data from GedMatch and DNA.Land compares.

In my spreadsheet, below, I pasted the segment match information from DNA.Land in the first 5 columns with a red header. Note that DNA.Land does not provide the number of shared SNPs.

At right, I pasted the match information from GedMatch, with a green header. We know that GedMatch has a history of accurately comparing segments, and we can do a cross platform comparison. I originally uploaded my FTDNA file to DNA.Land and Karen uploaded an Ancestry file. Those are the two files I compared at GedMatch, because the same actual matching locations are being compared at both vendors, DNA.Land (in addition to imputed regions) and GedMatch.

I then copied the matching segments from GedMatch (3cM, 100 SNPs threshold) and placed them in the middle columns in the same row where they matched corresponding DNA.Land segments. If any portion of the two vendors segments overlapped, I copied them as a match, although two are small and partial and one is almost negligible. As you can see, there are only 10 segments with any overlap at all in the center section. Please note that I am NOT suggesting these are valid or real matches.  At this point, it’s only a math/match exercise, not an analysis.

The match comparison column (yellow header) is where I commented on the match itself. In some cases, the lack of the number of SNPs at DNA.Land was detrimental to understanding which vendor was a higher match. Therefore, when possible, I marked the higher vendor in the Match Comparison column with the color of their corresponding header.

Analysis

Frankly, I was shocked at the lack of matching between GedMatch and DNA.Land. Trying to understand the discrepancy, I decided to look at the matches between Karen, who has been very helpful, and me at other vendors.

I then looked at our matches at Ancestry, 23andMe, MyHeritage and at Family Tree DNA.

The best comparison would be at Family Tree DNA where Karen loaded her Ancestry file.  Therefore, I’m comparing apples to apples, meaning equivalent to the comparison at GedMatch and DNA.Land (before imputation).

It’s impossible to tell much without a chromosome browser at Ancestry, especially after Timber processing which reduces matching DNA.

DNA.Land categorized my match to Karen as “high certainty.” My match with Karen appears to be a valid match based on the longest segment(s) of approximately 30cM on chromosome 8.

  • Of the 4 segments that DNA.Land identifies as “recent” matches, 2 are not reflected at all in the GedMatch or Family Tree DNA matching, suggesting that these regions were imputed entirely, and incorrectly.
  • Of the 4 segments that DNA.Land identifies as “recent” matches, the 2 on chromosome 8 are actually one segment that imputation apparently divided. According to DNA.LAND, imputation can increase the number of matching segments. I don’t think it should break existing segments, meaning segments actually tested, into multiple pieces. In any event, the two vendors do agree on this match, even though DNA.Land breaks the matching segment into two pieces where GedMatch and Family Tree DNA do not. I’m presuming (I hate that word) that this is the one segment that Ancestry calls as a match as well, because it’s the longest, but Ancestry’s Timber algorithm downgrades the match portion of that segment by removing 11cM (according to DNA.Land) from 29cM to 18cM or removes 13cM (according to both GedMatch and Family Tree DNA) from 31cM to 18cM. Both GedMatch and Family Tree DNA agree and appear to be accurate at 31cM.
  • Of the total 39 matching segments of any size, utilizing the 3cM threshold and 100 SNPs, which I set artificially very low, GedMatch only found 10 matching segments with any portion of the segment in common, meaning that at least 29 were entirely erroneous matches.
  • Resetting the GedMatch match threshold to 3 cM and 300 SNPS, a more reasonable SNP threshold for 3cM, GedMatch only reports 3 matching segments, one of which is chromosome 8 (undivided) which means at this threshold, 36 of the 39 matching DNA.Land segments are entirely erroneous. Setting the threshold to a more reasonable 5cM or 7cM and 500 SNPs would result in only the one match on chromosome 8.

  • If 29 of 39 segments (at 3cM 100 SNPs) are erroneously reported, that equates to 74.36% erroneous matches due to imputation alone, with out considering identical by chance (IBC) matches.
  • If 35 of 39 segments (at 3cM 300 SNPs) are erroneously reported, that equates to 89.74% percent erroneous matches, again without considering those that might be IBC.

Predicted vs Actual

One additional piece of information that I gathered during this process is the predicted relationship.

Vendor Total cM Total Segments Longest Segment Predicted Relationship
DNA.Land 162 to 3 cM 39 to 3 cM 17.3 & 12, split 3C
GedMatch 123 to 3 cM 27 to 3 cM 31.5 5.1 gen distant
Family Tree DNA 40 to 1 cM 12 to 1 cM 32 3-5C
MyHeritage No match No match No match No match
Ancestry 18.1 1 18.1 5-8C
23andMe 26 1 26 3-6C

Karen utilized her Ancestry file and I used my Family Tree DNA file for all of the above matching except at 23andMe and Ancestry where we are both tested on the vendors’ platform. Neither 23andMe nor Ancestry accept uploads. I included the 23andMe and Ancestry comparisons as additional reference points.

The lack of a match at MyHeritage, another company that implements imputation, is quite interesting. Karen and I, even with a significantly sized segment are not shown as a match at MyHeritage.

If imputation actually breaks some matching segments apart, like the chromosome 8 segment at DNA.Land, it’s possible that the resulting smaller individual segments simply didn’t exceed the MyHeritage matching threshold. It would appear that the MyHeritage matching threshold is probably 9cM, given that my smallest segment match of all my matches at MyHeritage is 9cM. Therefore, a 31 or 32 cM segment would have to be broken into 4 roughly equally sized pieces (32/4=8) for the match to Karen not to be detected because all segment pieces are under 9cM. MyHeritage has experienced unreliable matching since their rollout in mid 2016, so their issue may or may not be imputation related.

The Common Ancestor

At Family Tree DNA, Karen does not match my mother, so I can tell positively that she is related through my father’s line. She and I triangulate on our common segment with three other individuals who descend from Abraham Estes 1647-1720 .

Utilizing the chromosome browser, we do indeed match on chromosome 8 on a long segment, which is also our only match over 5cM at Family Tree DNA.

Based on our trees as well as the trees of our three triangulated Estes matches, Karen and I are most probably either 8th cousins, or 8th cousins once removed, assuming that is our only common line. I am 8th cousins with the other three triangulated matches on chromosome 8. Karen’s line has yet to be proven.

Imputation Matching Summary

I like the way that DNA.Land presents some of their features, but as for matching accuracy, you can view the match quality in various ways:

  1. DNA.Land did find the large match on chromosome 8. Of course, in terms of matching, that’s pretty difficult to miss at roughly 30cM, although MyHeritage managed. Imputation did split the large match into two, somehow, even though Karen and I match on that same segment as one segment at other vendors comparing the same files.
  2. Of the 39 DNA.Land total matches, other than the chromosome 8 match, two other matches are partial matches, according to GedMatch. Both are under 7cM.
  3. Of DNA.Land’s total 39 matches, 35 are entirely wrong, in addition to the two that are split, including two inaccurate imputed matches at over 5cM.
  4. At DNA.Land, I’m not so concerned about discerning between “real” and “false” small segment matches, as compared to both FTDNA and GedMatch, as I am about incorrectly imputed segments and matches. Whether small matches in general are false positives or legitimate can be debated, each smaller segment match based on its own merits. Truthfully, with larger segments to deal with, I tend to ignore smaller segments anyway, at least initially. However, imputation adds another layer of uncertainty on top of actual matching, especially, it appears, with smaller matches. Imputing entire segments of incorrect DNA concerns me.
  5. Having said that, I find it very concerning that MyHeritage who also utilizes imputation missed a significant match of over 30cM. I don’t know of a match of this size that has ever been proven to be a false match (through parental phasing), and in this case, we know which ancestor this segment descends from through independent verification utilizing multiple other matches. MyHeritage should have found that match, regardless of imputation, because that match is from portions of the two files that were both tested, not imputed.

Summary

To date, I’m not impressed with imputation matching relative to genetic genealogy at either DNA.Land or MyHeritage.

In one case, that of DNA.Land, imputation shows matches for segments that are not shown as matches at either Family Tree DNA or GedMatch who are comparing the same two testers’ files, but without imputation. Since DNA.Land did find the larger segment, and many of their smaller segments are simply wrong, I would suggest that perhaps they should only show larger segments. Of course, anyone who finds DNA.Land is probably an experienced genetic genealogist and probably already has files at both GedMatch and Family Tree DNA, so hopefully savvy enough to realize there are issues with DNA.Land’s matching.

In the second imputation case, that of MyHeritage, the match with Karen is missed entirely, although that may not be a function of imputation. It’s hard to determine.  MyHeritage is also comparing the same two files uploaded by Karen and I to the other vendors who found that match, both vendors who do and don’t utilize imputation.

Regardless of imputing additional locations, MyHeritage should have found the matching segment on chromosome 8 because that region does NOT need to be imputed. Their failure to do so may be a function of their matching routine and not of imputation itself. At this point, it’s impossible to discern the cause. We only know, based on matching at other vendors, that the non-match at MyHeritage is inaccurate.

Here’s what DNA.Land has to say about the imputed VCF file, which holds all of your imputed values, when you download the file. They pull no punches about imputation.

“Noisey and probabilistic.” Yes, I’d say they are right, and problematic as well, at least for genetic genealogists.

Extrapolating this even further, I find it more than a little frightening that my imputed data at DNA.Land will be utilized for medical research.

Quoting now from Promethease, a medical reference site that allows the consumer to upload their raw data files, providing consumers with a list of SNPs having either positive or negative research in academic literature:

DNA.land will take a person’s data as produced by such companies and impute additional variants based on population frequency statistics. To put this in concrete terms, a person uploading a typical 23andMe file of ~700,000 variants to DNA.land will get back an (imputed) file of ~39 million variants, all predicted to be present in the person. Promethease reports from such imputed files typically contain about 50% more information (i.e. 50% more genotypes) than the corresponding reports from raw (non-imputed) data.

Translated, this means that your imputed data provides twice as much “genetic information” as your actual tested data. The question remains, of course, how much of this imputed data is accurate.

That will be the topic of the third imputation article. Stay tuned.

_____________________________________________________________________

Standard Disclosure

This standard disclosure appears at the bottom of every article in compliance with the FTC Guidelines.

Hot links are provided to Family Tree DNA, where appropriate. If you wish to purchase one of their products, and you click through one of the links in an article to Family Tree DNA, or on the sidebar of this blog, I receive a small contribution if you make a purchase. Clicking through the link does not affect the price you pay. This affiliate relationship helps to keep this publication, with more than 850 articles about all aspects of genetic genealogy, free for everyone.

I do not accept sponsorship for this blog, nor do I write paid articles, nor do I accept contributions of any type from any vendor in order to review any product, etc. In fact, I pay a premium price to prevent ads from appearing on this blog.

When reviewing products, in most cases, I pay the same price and order in the same way as any other consumer. If not, I state very clearly in the article any special consideration received. In other words, you are reading my opinions as a long-time consumer and consultant in the genetic genealogy field.

I will never link to a product about which I have reservations or qualms, either about the product or about the company offering the product. I only recommend products that I use myself and bring value to the genetic genealogy community. If you wonder why there aren’t more links, that’s why and that’s my commitment to you.

Thank you for your readership, your ongoing support and for purchasing through the affiliate link if you are interested in making a purchase at Family Tree DNA.

Using Spousal Surnames and DNA to Unravel Male Lines

When Y DNA matching at Family Tree DNA, it’s not uncommon for men to match other males of the same surname who share the same ancestor. In fact, that’s what we hope for, fervently!

However, if you’re stuck downstream, you may need to figure out which of several male children you descend from.

If you’re staring at a brick wall working yourselves back in time, you may need to try working forward, utilizing various types of information, including wives’ surnames.

For all intents and purposes, this is my Vannoy line, in Wilkes County, NC, so let’s use it as an example, because it embodies both the promise and the peril of this approach.

So, there you sit, disconnected from the Vannoy line. That little yellow box is just so depressing. So close, but yet so far. And yes, we’ve already exhausted the available paper trail records, years ago.

We know the lineage back through Elijah Vannoy, who was born between 1784-1786 in Wilkes County, or vicinity. We know my Vannoy cousin Y DNA matches with other men from the Vannoy line upstream of John Francis Vannoy, the known father of four sons in Wilkes County, NC and the first (and only) Vannoy to move from New Jersey to that part of North Carolina.

Therefore, we know who the candidates are to be Elijah’s father, but the connection in the yellow box is missing. Many Wilkes County records have gone missing over the years and births were not recorded in that timeframe.  The records from neighboring Ashe County where Daniel Vannoy lived burned during the Civil War, although some records did survive. In other words, the records are rather like Swiss cheese. Welcome to genealogy in the south.

Which of John Francis Vannoy’s four sons does Elijah descend from?

Let’s see what we can discover.

Contact Matches and Ask for Help

The first thing I would do is to ask for assistance from your surname matches.

Let’s say that you match a known descendant of each of these four men, meaning each of John Francis Vannoy’s sons. Ask each person if they know where the male Vannoy descendants of each son went along with any documentation they might have. If your ancestor, Elijah in this case, is not found in the same location as the sons, geography may be your friend.

In our case, we know that Francis Vannoy migrated to Knox County, Kentucky, but that was after he signed for his daughter’s marriage in Wilkes Co., NC in 1812. It was also about this time that Elijah Vannoy migrated to Claiborne County, TN, in the same direction, but not the same location. The two locations are an hour away by car today, separated by mountains and the Cumberland Gap, a nontrivial barrier.

We also know that Nathaniel Vannoy left a Bible that did not list Elijah as one of his children, but with a gap large enough to possibly encompass another child.  If you’re thinking to yourself, “Who would leave a child’s birth out of the Bible?,” I though the same thing until I encountered it myself personally in another line.  However, the Bible record does make Nathaniel a less likely father candidate, despite a persistent rumor that Nathaniel was Elijah’s father.

Our only other clues are some tax records recording the number of children in the household of various ages, but none are conclusive. None of these men had wills.

Y DNA Genetic Distance

Your Y DNA matches will show how many mutations you are from them at a particular marker level.

Please note that you can click to enlarge any graphic.

The number of mutations between two men is called the genetic distance.

The rule of thumb is that the more mutations, the further back in time the common ancestor. The problem is, the rule of thumb doesn’t always work. DNA mutates when it darned well pleases, not on any clock that we can measure with that degree of accuracy – at least not accurately enough to tell which of 4 sons a man descends from – unless that line has incurred a defining mutation between the ancestor and the current generation. We call those line marker mutations. To determine the mutation history, you need multiple men from each line to have tested.

You can read more about Y DNA matching in the article, Concepts – Y DNA Matching and Connecting with your Paternal Ancestor.

Check Autosomal DNA Tests

Next, check to see if your Y DNA matches from all Vannoy lines have also taken the autosomal Family Finder test, noted as FF, which shows matches from all ancestral lines, not just the paternal line.

You can see in the match list above that not many have taken the Family Finder test. Ask if they would be willing to upgrade. Be prepared to pay if need be – because you are, after all, the one with the “problem” to solve.

Generally, I simply offer to pay. It’s well worth it to me, and given that paper records don’t exist to answer the question – a DNA test under $100 is cheap. Right now, Family Finder tests are on sale for $69 until the end of the month.

Check for Intermarriage

While you’re waiting for autosomal DNA results, check the pedigrees for all for lines involved to see if you are otherwise related to these men or their wives.

For example, in Andrew Vannoy’s wife’s line and Elijah Vannoy’s wife’s line, we have a common ancestor. George Shepherd and Elizabeth Mary Angelique Daye are common to both lines, and John Shepherd’s wife is unknown, so we have one known problem and one unknown surname.

You can tell already that this could be messy, because we can’t really use Andrew Vannoy’s wife’s line to search for matches because Elijah’s line is likely to match through Andrew’s wife since Susannah Shepherd and Lois McNiel share a common lineage. Rats!

We’ll mark these in red to remind ourselves.

Check Advanced Matching

Family Tree DNA provides a wonderful tool that allows you to compare matches of different kinds of DNA. The Advanced Matching tab is found under “Tools and Apps” under the myFTDNA tab at the upper left.

In this case, I’m going to use the Advanced Match feature to see which of my Vannoy cousin’s Y matches at 37 markers, within the Vannoy DNA project, also match him autosomally.

This report is particularly nice, because it shows number of Y mutations, often indicating distance to a common ancestor, as well as the estimated autosomal relationship range.

You can see in this case that the first Vannoy male, “A,” is a close match both on Y DNA and autosomally, with 1 mutation difference and falling in the 2nd to 4th cousin range, as compared to the second Vannoy male, “D,” who is 3 mutations different and falls into the 4th to remote cousin range.

Not every Vannoy male may have joined the Vannoy project, so you’ll want to run this report a second time, replacing the Vannoy project search criteria with “The Entire Database.”

Unfortunately, not everyone that I need has taken the Family Finder test, so I’ll be contacting a few men, asking if I can sponsor their upgrades.

Let’s move on to our next tactic, using the wives’ surnames.

Search Utilizing the Wife’s Surname

We already know that we can’t rely on the Shepherd surname, so we’ll have to utilize the surnames of the other three wives:

  • Millicent Henderson – parents Thomas Henderson born circa 1730 Virginia, died 1806 Laurens, SC, wife Frances, surname unknown
  • Elizabeth Ray (Raye) – parents William Ray born circa 1725/1730 Herdford, England, died 1783 Wilkes Co., NC (the portion now Ashe Co.,) wife Elizabeth Gordon born circa 1783 Amherst Co., VA and died 1804 Surry Co., NC
  • Sarah Hickerson – parents Charles Hickerson born circa 1725 Stafford Co., VA, died before 1793 Wilkes Co., NC, wife Mary Lytle

Utilizing the Family Finder match search function, I’m going to search for matches that include the wives surnames, but are NOT descended from the Vannoy line.

Hickerson produced no non-Vannoy matches utilizing the matches of my first Vannoy cousin, but Henderson is another matter entirely.

Since the Henderson line would be on my cousin’s father’s side, the matches that are most relevant are the ones phased to his paternal line, those showing the blue person icon.

The surname that you have entered as the search criteria will show as blue in the Ancestral Surname list, at far right, and other matching surnames will show as black. Please note that this includes surnames from ANY person in the match’s tree if they have uploaded a Gedcom file, not just surnames of direct ancestral lines. Therefore, if the match has a tree, it’s important to click on the pedigree icon and search for the surname in question. Don’t assume.

Altogether, there are 76 Henderson matches, of which 17 are phased to his paternal line. You’ll need to review each one of at least the 17. Personally, I would painstakingly review each one of the 76. You never know where a shred of information will be found.

Please note, finding a match with a common surname DOES NOT MEAN THAT YOU MATCH THIS PERSON THROUGH THAT SURNAME. Even finding a person with a common ancestor doesn’t mean that you both descend from that ancestor. You may have a second common ancestor. It means that you have more work to do, as proof, but it’s the beginning you need.

Of course, the first thing we need to do is eliminate any matches who also descend from a Vannoy, because there is no way to know if the matching DNA is through the Vannoy or Henderson lines. However, first, take note of how that person descends from the Vannoy line.

You can see your matches entire surname list by clicking on their profile picture.

The surname, Ray, is more difficult, because the search for Ray also returns names like Bray and Wray, as well as Ray.

But Wait – There’s a Happy Ending!

If you’re thinking, “this is a lot of work,” yes, it is.

Yes, you are absolutely going to do the genealogy of the wives’ lines so you can recognize if and how your matches might connect.

I enter the wives’ lines into my genealogy software and then I search for the ancestors found in my matches trees to see if they descend from that line.

One tip to make this easier is to test multiple people in the same line – regardless of whether they are males or carry the desired surname. They simply need to be descendants – that’s the beauty of autosomal DNA and why I carry kits with me wherever I go.  And yes, I’m really serious about that!

When you have multiple testers from the same line, you can utilize each test independently, searching for each surname in the Family Finder results.  Then, from the surname match list, select a sibling or other close relative with that same surname in their list, then choose the ICW feature. This allows you to see who both of those people match who also carries the Henderson surname in their surname list.

Not successful with that initial cousin’s match results – like I wasn’t with Hickerson?

Rinse and repeat, with every single person who you can find who has descended from the line in question. I started the process over again with a second cousin and a Hickerson search.

About the time you’re getting really, really tired of looking at all of those trees, extending the branches of other people’s lines, and are about to give up and go to bed because it’s 3 AM and you’re discouraged, you see something like this:

Yep, it’s good old Charles Hickerson and Mary Lytle.  I could hardly believe my eyes!!! This Hickerson match to a cousin in my Vannoy line descends from Charles Hickerson’s son, Joshua.

All of a sudden…it’s all worthwhile! Your fatigue is gone, replaced by adrenalin and you couldn’t sleep now if your life depended on it!

Using the ICW (in common with feature) to find additional known cousins who match the person with Charles Hickerson and Mary Lytle in their tree, I found a total of three Vannoy cousins with significant matches.

Using the chromosome browser to compare, I’ve confirmed that one segment is a triangulated match of 12.69 cM (blue) on chromosome 2.

You can read more about triangulation in the article, Concepts – Why Genetic Genealogy and Triangulation? as well as the article, Concepts – Match Groups and Triangulation.

Do I wish I had more than three people in my triangulation group? Yes, of course, but with a match of this size triangulated between cousins and a Hickerson descendant who is a 30 year genealogist, sporting a relatively complete tree and no other common lines, it’s a great place to begin digging deeper! This isn’t the end, but a new beginning!

After obsessively digging through the matches of every Elijah Vannoy descended cousin I can find (sleep is overrated anyway) and whose account I have access to, I have now discovered matches with four additional people who have no other common lines with the Vannoy cousins and who descend from Charles Hickerson and Mary Lytle through sons David and Joseph Hickerson. I can’t tell if they triangulate without access to accounts that I don’t have access to, so I’ve sent e-mails requesting additional information.

WooHoo Happy Day!!! There’s a really big crack in the brick wall and I’ve just witnessed the sunrise of a beautiful, amazing day.

I think Elijah’s parents are…drum roll…Daniel Vannoy and Sarah Hickerson!

Which walls do you need to fall and how can you use this technique?

______________________________________________________________________

Standard Disclosure

This standard disclosure will now appear at the bottom of every article in compliance with the FTC Guidelines.

Hot links are provided to Family Tree DNA, where appropriate. If you wish to purchase one of their products, and you click through one of the links in an article to Family Tree DNA, or on the sidebar of this blog, I receive a small contribution if you make a purchase. Clicking through the link does not affect the price you pay. This affiliate relationship helps to keep this publication, with more than 850 articles about all aspects of genetic genealogy, free for everyone.

I do not accept sponsorship for this blog, nor do I write paid articles, nor do I accept contributions of any type from any vendor in order to review any product, etc. In fact, I pay a premium price to prevent ads from appearing on this blog.

When reviewing products, in most cases, I pay the same price and order in the same way as any other consumer. If not, I state very clearly in the article any special consideration received. In other words, you are reading my opinions as a long-time consumer and consultant in the genetic genealogy field.

I will never link to a product about which I have reservations or qualms, either about the product or about the company offering the product. I only recommend products that I use myself and bring value to the genetic genealogy community. If you wonder why there aren’t more links, that’s why and that’s my commitment to you.

Thank you for your readership, your ongoing support and for purchasing through the affiliate link if you are interested in making a purchase at Family Tree DNA.

Concepts – Mirror Trees

What are mirror trees, and why would I ever want to use one?

Great question.

You’ll hear genealogists, especially adoptees or persons trying to find a missing parent mention using mirror trees.

Mirror trees are a technique that genealogists use to help identify a missing common ancestor by recreating the tree of a match and strategically attaching your DNA to their tree to see who you match that descends from which line in their tree.

I have used mirror trees to attempt to determine the common line of a close cousin whose common ancestor (with me) I simply CANNOT discover. Notice the words “attempt to.”  Mirror trees are not a sure-fire answer, and they can sometimes lead you astray.

Foundation Concept

The foundation concept of a mirror tree is very straightforward.

Let’s say you match Susie as a second cousin. This means that you should share a great-grandparent with Susie. A relationship this close OUGHT to be relatively simple to figure out – except sometimes it isn’t.

Note that vendor relationship estimates are just that, estimates of relatedness based on total and longest cM, and they can be off in either direction.

In the case of third cousins or closer, vendor estimates are generally pretty accurate.

You can view the ranges of cMs and relationships in this chart.

Of course, when you match someone, you don’t know who the common ancestor is, nor do you necessarily have access to their pedigree chart or tree. If you do, and you can easily see the identity of the common ancestral couple, that’s great – but life isn’t always that simple.

In Practice

In my case, I match Susie, and no place in our trees, at ALL, is a common ancestor, let alone three generations back in time. Furthermore, her entire line and my father’s line were all from Appalachia, so common geography doesn’t help.

We matched at Ancestry, so we both uploaded to GedMatch, where we match almost exactly the same, and the relationship prediction is the same as well. Someplace, in one of our trees, is an NPE, a misattributed parentage – because both of our trees are complete back beyond those generations.

Uh oh.

So, I created a tree in my Ancestry account, duplicating Susie’s tree, and making it private – at least one generation beyond great-grandparents – just in case the estimate is wrong. Then, I connected my DNA to her tree, as her.

In my case, I have two DNA tests at Ancestry, my V1 results and my V2 results. I never really thought about this as a way to keep one set of results working for me, connected to my own tree, and to have a second set of results to connect to mirror trees – but that’s exactly what I’ve done. I utilize the second set of results as my “working on a problem” results while the first set of results just stays connected to my own tree.

After connecting my DNA results to the mirror tree and giving Ancestry a couple of days to cycle through, creating connections and green leaf “shared ancestor” hints, I checked to see who my DNA attached to her tree says I match, and which line in her tree “lights up” with match hints. If I can’t tell by connecting my DNA as her, I can also connect my DNA to her parents and grandparents, one at a time – again – looking for green leaf shared ancestor hints in those lines. No hints = wrong line.

This process shows me in which of her lines our common lineage is found – even if I can’t exactly pinpoint the common ancestors just yet.

Instructions

I had planned to provide step by step directions for how to create a mirror tree and then how to utilize the results, but then I discovered that someone else has done an absolutely wonderful job of writing mirror tree instructions. There is absolutely no reason to recreate the wheel, so I’m linking to two articles from the blog, Resurrecting Roots, as follows:

After building a mirror tree, their next article explains what to do next.

Now, if I could just figure out that common ancestor with my second cousin match. You may encounter the same type of challenge.

If the right people haven’t tested yet, you may not be able to achieve your goal on the first try. Or, in my case, it appears that we may have more than one common ancestor – complicating matters a bit. If this happens to you, wait a few weeks/months and connect the tree again, or build it out another generation to increase your changes of a green leaf hint.

The great thing about genetic genealogy is that more people are testing every single day. Give mirror trees a try if you’re an adoptee, trying to find an unidentified family member in a relatively close generation, or are being driven absolutely batty with a relatively close match that you can’t solve!

If you need help solving these types of problems, I suggest contacting dnaadoption and taking one of their classes.  They aren’t just for adoptees.

__________________________________________________________________

Standard Disclosure

This standard disclosure will now appear at the bottom of every article in compliance with the FTC Guidelines.

Hot links are provided to Family Tree DNA, where appropriate. If you wish to purchase one of their products, and you click through one of the links in an article to Family Tree DNA, or on the sidebar of this blog, I receive a small contribution if you make a purchase. Clicking through the link does not affect the price you pay. This affiliate relationship helps to keep this publication, with more than 850 articles about all aspects of genetic genealogy, free for everyone.

I do not accept sponsorship for this blog, nor do I write paid articles, nor do I accept contributions of any type from any vendor in order to review any product, etc. In fact, I pay a premium price to prevent ads from appearing on this blog.

When reviewing products, in most cases, I pay the same price and order in the same way as any other consumer. If not, I state very clearly in the article any special consideration received. In other words, you are reading my opinions as a long-time consumer and consultant in the genetic genealogy field.

I will never link to a product about which I have reservations or qualms, either about the product or about the company offering the product. I only recommend products that I use myself and bring value to the genetic genealogy community. If you wonder why there aren’t more links, that’s why and that’s my commitment to you.

Thank you for your readership, your ongoing support and for purchasing through the affiliate link if you are interested in making a purchase at Family Tree DNA.