MyHeritage LIVE Conference Day 2 – The Science Behind DNA Matching    

The MyHeritage LIVE Oslo conference is but a fond memory now, and I would count it as a resounding success.

Perhaps one of the reasons I enjoyed it so much is the scientific aspect and because the content is very focused on a topic I enjoy without being the size and complexity of Rootstech. The smaller, more intimate venue also provides access to the “right” people as well as the ability to meet other attendees and not be overwhelmed by the sheer size.

Here are some stats:

  • 401 registered guests
  • 28 countries represented including distant places like Australia and South America
  • More than 20 speakers plus the hands-on workshops where specialist teams worked with students
  • 38 sessions and workshops, plus the party
  • 60,000 livestream participants, in spite of the time differences around the world

I was blown away by the number of livestream attendees.

I don’t know what criteria Gilad Japhet will be using to determine “success” but I can’t imagine this conference being judged as anything but.

Let’s take a look at the second day. I spent part of the time talking to people and drifting in and out of the rear of several sessions for a few minutes. I meant to visit some of the workshops, but there was just too much good, distracting content elsewhere.

I began Sunday in Mike Mansfield’s presentation about SuperSearch. Yes, I really did attend a few sessions not about DNA, but my favorite was the session on Improved DNA Matching.

Improved DNA Matching

I’m sure it won’t surprise any of my readers that my favorite presentations were about the actual science of genetic genealogy.

Consumers don’t really need to understand the science behind autosomal results to reap the benefits, but the underlying science is part of what I love – and it’s important for me to understand the underpinnings to be able to unravel the fine points of what the resulting matches are and are not revealing. Misinterpretation of DNA results leading to faulty conclusions is a real issue in genetic genealogy today. Consequently, I feel that anyone working with other people’s results and providing advice really needs to understand how the science and technology together works.

Dr. Daphna Weissglas-Volkov, a population geneticist by training, although she clearly functions far beyond that scope today, gave a very interesting presentation about how MyHeritage handles (their greatly improved) DNA Matching. I’m hitting the high points here, but I would strongly encourage you to watch the video of this session when they are made available online.

In addition to Dr. Weissglas-Volkov’s slides, I’ve added some additional explanations and examples in various places. You can easily tell that the slides are hers and the graphics that aren’t MyHeritage slides are mine.

Dr. Weissglas-Volkov began the session by introducing the MyHeritage science team and then explaining terminology to set the stage.

A match is when two people match each other on a fairly long piece of DNA. Of course, “fairly long” is defined differently by each vendor.

Your genetic map (of your chromosomes) is comprised of the DNA you inherit from different ancestors by the process of recombination when DNA is transferred from the parents to the child. A centiMorgan is the relatively likelihood that a recombination will occur in a single generation. On average, 36 recombinations occur in each generation, meaning that the DNA is divided on any chromosome. However, women, for reasons unknown have about 1.5 times as many recombinations as men.

You can’t see that when looking at an example of a person compared to their parents, of course, because each individual is a full match to each parent, but you can see this visually when comparing a grandchild to their maternal grandmother and their paternal grandmother on a chromosome browser.

The above illustration is the same female grandchild compared to her maternal grandmother, at left, and her paternal grandmother at right. Therefore the number of crossovers at left is through a female child (her mother), and the number at right is through a male child (her father.)

# of Crossovers
Through female child – left 57
Through male child – right 22

There are more segments at left, through the mother, and the segments are generally shorter, because they have been divided into more pieces.

At right, fewer and larger segments through the father.

Keep in mind that because you have a strand of DNA from each parent, with exactly the same “street addresses,” that what is produced by DNA sequencing are two columns of data – but your Mom’s and Dad’s DNA is intermixed.

The information in the two columns can’t be identified as Mom’s or Dad’s DNA or strand at this point.

That interspersed raw data is called a genotype. A haplotype is when Mom’s and Dad’s DNA can be reassembled into “sides” so you can attribute the two letters at each address to either Mom or Dad.

Here’s a quick example.

The goal, of course, is to figure out how to reassemble your DNA into Mom’s side and Dad’s side so that we know that someone matching you is actually matching on all As (Mom) or all Gs (Dad,) in this example, and not a false match that zigzags back and forth between Mom and Dad.

The best way to accomplish that goal of course is trio phasing, when the child and both parents are available, so by comparing the child’s DNA with the parents you can assign the two strands of the child’s DNA.

Unfortunately, few people have both or even one parent available in order to actual divide their DNA into “sides,” so the next best avenue is statistical phasing. I’ve called this academic phasing in the past, as compared to parental phasing which MyHeritage refers to as trio phasing.

There’s a huge amount of confusion about phasing, with few people understanding there are two distinct types.

Statistical phasing is a type of machine learning where a large number of reference populations are studied. Since we know that DNA travels together in blocks when inherited, statistical phasing learns which DNA travels with which buddy DNA – and creates probabilities. Your DNA is then compared to these models and your DNA is reshuffled in order to assemble your DNA into two groups – one representing your Mom’s DNA and one representing your Dad’s DNA, according to statistical probability.

Looking at your genotype, if we know that As group together at those 6 addresses in my example 95% of the time, then we know that the most likely scenario to create a haplotype is that all of the As came from one parent and all of the Gs from the other parent – although without additional information, there is no way to yet assign the maternal and paternal identifier. At this point, we only know parent 1 and parent 2.

In order to train the computers (machine learning) to properly statistically phase testers’ results, MyHeritage uses known relationships of people to teach the machines. In other words, their reference panels of proven haplotypes grows all of the time as parent/child trios test.

Dr. Weissglas-Volkev then moved on to imputation.

When sequencing DNA, not every location reads accurately, so the missing values can be imputed, or “put back” using imputation.

Initially imputation was a hot mess. Not just for MyHeritage, but for all vendors, imputation having been forced upon them (and therefore us) by Illumina’s change to the GSA chip.

However, machine learning means that imputation models improve constantly, and matching using imputation is greatly improved at MyHeritage today.

Imputation can do more than just fill in blanks left by sequencing read errors.

The benefit of imputation to the genetic genealogy community is that vendors using disparate chips has forced vendors that want to allow uploads to utilize imputation to create a global template that incorporates all of the locations from each vendor, then impute the values they don’t actually test for themselves to complete the full template for each person.

In the example below, you can see that no vendor tests all available locations, but when imputation extends the sequences of all testers to the full 1-500 locations, the results can easily be compared to every other tester because every tester now has values in locations 1-500, regardless of which vendor/chip was utilized in their actual testing.

Therefore, using imputation, MyHeritage is able to match between quite disparate chips, such as the traditional Illumina chips (OmniExpress), the custom Ancestry chip and the new GSA chip utilized by 23andMe and LivingDNA.

So, how are matches determined?

Matching

First your DNA and that of another person are scanned for nearly identical seed sequences.

A minimum segment length of 6cM must be identified for further match processing to occur. Anything below 6cM is discarded at this point.

The match is then further evaluated to see if the seed match is of a high enough quality that it should be perfected and should count as a match. Other segments continue to be evaluated as well. If the total matching segment(s) is 8 total cM or greater, it’s considered a valid match. MyHeritage has taken the position that they would rather give you a few accidental false matches than to miss good matches. I appreciate that position.

Window cleaning is how they refer to the process of removing pileup regions known to occur in the human genome. This is NOT the same as Ancestry’s routine that removes areas they determine to be “too matchy” for you individually.

The difference is that in humans, for example, there is a segment of chromosome 6 where, for some reason, almost all humans match. Matching across that segment is not informative for genetic genealogy, so that region along with several others similar in nature are removed. At Ancestry, those genome-wide pileup segments are removed, along with other regions where Ancestry decides that you personally have too many matches. The problem is that for me, these “too matchy” segments are many of my Acadian matches. Acadians are endogamous, so lots of them match each other because as a small intermarried population, they share a great deal of the same DNA. However, to me, because I have one great-grandfather that’s Acadian, that “too matchy” information IS valuable although I understand that it wouldn’t be for someone that is 100% Acadian or Jewish.

In situations such as Ashkenazi Jewish matching, which is highly endogamous, MyHeritage uses a higher matching threshold. Otherwise every Ashkenazi person would match every other Ashkenazi person because they all descend from a small founder population, and for genealogy, that’s not useful.

The last step in processing matches is to establish the confidence level that the match is accurately predicted at the correct level – meaning the relationship range based on the amount of matching DNA and other criteria.

For example, does this match cluster with other proven matches of the same known relationship level?

From several confidence ascertainment steps, a confidence score is assigned to the predicted relationship.

Of course, you as a customer see none of this background processing, just the fact that you do match, the size of the match and the confidence score. That’s what genealogists need!

Matching Versus Triangulation Thresholds

Confusion exists about matching thresholds versus triangulation thresholds.

While any single segment must be over 6 cM in length for the matching process to begin, the actual match threshold at MyHeritage is a total of 8 cM.

I took a look at my lowest match at MyHeritage.

I have two segments, one 6.1 cM segment, and one 6 cM segment that match. It would appear that if I only had one 6 cM segment, it would not show as a match because I didn’t have the minimum 8 cM total.

Triangulation Threshold

However, after you pass that matching criteria and move on to triangulation with a matching individual, you have the option of selecting the triangulation threshold, which is not the same thing as the match threshold. The match threshold does not change, but you can change the triangulation threshold from 2 cM to 8 cM and selections in-between.

In the example below, I’m comparing myself against two known relatives.

You won’t be shown any matches below the 6 cM individual segment threshold, BUT you can view triangulated segments of different sizes. This is because matching segments often don’t line up exactly and the triangulated overlap between several individuals may be very small, but may still be useful information.

Flying your mouse over the location in the bubble, which is the triangulated segment, tells you the size of the triangulated portion. If you selected the 2 cM triangulation, you would see smaller triangulated portions of matches.

Closing Session

The conference was closed by Aaron Godfrey, a super-nice MyHeritage employee from the UK. The closing session is worth watching on the recorded livestream when it becomes available, in part because there are feel good moments.

However, the piece of information I was looking for was whether there will be a MyHeritage LIVE conference in 2019, and if so, where.

I asked Gilad afterwards and he said that they will be evaluating the feedback from attendees and others when making that decision.

So, if you attended or joined the livestream sessions and found value, please let MyHeritage know so that they can factor your feedback onto their decision. If there are topics you’d like to see as sessions, I’m sure they’d love to hear about that too. Me, I’m always voting for more DNA😊

I hope to hear about MyHeritage LIVE 2019, and I’m voting for any of the following locations:

  • Australia
  • New Zealand
  • Israel
  • Germany
  • Switzerland

What do you think?

DNA Painter – Touring the Chromosome Garden

This is the third article in a series about DNA Painter. To know DNA Painter is to love DNA Painter! Trust me!

The first two articles are:

The Chromosome Sudoku article introduces you to DNA Painter, it’s purpose and how to use the tool. The Mining Vendor Data article illustrates exactly how to find the segments you can paint from each of the main autosomal testing vendors and GedMatch.

This article is a leisurely tour through my colorful chromosome garden so that, together, we can see examples of how to utilize the information that chromosome painting unveils.

Chromosome painting can do amazing things: walk you back generations, show visual phasing…and reveal that there’s a mistake someplace, too.

If you’re not willing to be wrong and reconsider, this might not be the field for you😊

Automatic Triangulation

Chromosome painting automatically mathematically triangulates your DNA and in a much easier way than the old spreadsheet method. In fact, triangulation just happens, effortlessly IF you can determine which side is maternal and which side is paternal. Of course, you’ll always want to check to be sure that your matches also match each other. if not, then that’s an indication that maybe one or both are identical by chance.

The definition of triangulation in this context means:

  • To find a common segment
  • Of reasonable size (generally 7cM or over)
  • That is confirmed to a common ancestor with at least two other individuals
  • Who are not close family

Close family generally means parents, siblings, sometimes grandparents, although parents and grandparents can certainly be used to verify that the match is valid. The best triangulation situation is when you match those two other people through a second child, meaning siblings of your ancestor.

Different matches, depending on the circumstances, have a different level of value to you as a genealogist. In other words, some are more solid than others.

The X chromosome has special matching and triangulation rules, so we’ll talk about that when we get to that section.

Don’t think of chromosome painting as “doing” triangulation, because triangulation is a bonus of chromosome painting, and it just happens, automatically, so long as you can confirm that the segment is from either your maternal or paternal line.

What does triangulation look like in DNA Painter?

Here’s what my painted chromosome 15 looks like.

Here, I’ve drawn boxes around the areas that are triangulated. Actually, I made a small mistake and omitted one grey bar that’s also part of a second triangulation group. Can you spot it? Hint – look at the grey bars at far right in the overlapping triangulation group boxes where the red arrow is pointing. The box below should extend upwards to incorporate part of that top grey bar too.

Triangulation are those several segments piled up on top of each other. It means they match you at the same address on either the maternal or paternal chromosome. That’s good, but it’s not the same as an official “pileup area.”

Ok, so what’s a pileup area?

Pileup Areas

Certain locations in the human genome have been designated as pileup regions based on the fact that many people will match on these segments, not necessarily because they share a common relatively recent ancestor, but instead because a particular segment has a very high frequency in the general human population, or in the population of a specific region. Translated, this means that the segment might not be relevant to genealogy.

But before going too far with this discussion, it doesn’t mean that matches in pileup regions aren’t relevant to genealogy – just consider it a caution sign.

Aside from chromosome 6, which includes the HLA region, I’ve always been rather suspicious of pileup regions, because they don’t seem to hold true for me. You can view a chart that I assembled of the known pileup regions here.

DNA Painter generously includes pileup region warnings, in essence, along a chromosome bar at the top indicating “shared” or “both.”

Please note that you can click to enlarge any image.

Pileups regions are indicated by the grey hashed region at right. In my case, on chromosome 1, the pileup region isn’t piled up at all, on either the paternal (blue) chromosome or the maternal (pink) chromosome.

As you can see, I have exactly one match on the maternal side (green) and one (gold) on the paternal side (with a smidgen of a second grey match) as well, with both extending significantly beyond the pileup region. There is no reason to suspect that these gold and green matches aren’t valid.

If I saw many more matches in a pileup region than elsewhere, or many small matches, or DNA that was supposed to be from multiple ancestors not in the same line, then I’d have to question whether a pileup region was responsible.

Stacked Segments

DNA Painter provides you with the opportunity to see which of your ancestors’ segments stack. Stacking is a very important concept of DNA painting.

Before we talk about stacking, notice that the legend for which segments are color coded to specific ancestors is located at right. You can also click on the little grey box beside “Shared or Both,” at left, to show the match names beside the segments.  This is very useful when trying to analyze the accuracy of the match.

I wish DNA Painter offered an option to paint the ancestor’s names beside the segments. Maybe in V2. It’s really difficult to complain about anything because this tool is both free and awesome.

I’m using Powerpoint to label this group of stacked matches for this example.

This is a situation where I know my pedigree chart really well, so I know immediately upon looking at this stacked segment group who this piece of DNA descends from.

Here’s my pedigree chart that corresponds to the stacked segment.

We attribute each DNA segment to a couple initially based on who we match. In this case, that’s William George Estes and Ollie Bolton, my grandparents. The DNA remains attributed to them until we have evidence of which individual person in the couple received that DNA from their ancestors and passed it on to their descendant.

Therefore, the pink people are the half of the couple who we now know (thanks to DNA Painter) did NOT contribute that DNA segment, because we can track the DNA directly through the yellow line until we’re once again to another genetic brick wall couple.

My father is listed at left, and the DNA path runs back to William Crumley the second and his unknown wife who is haplogroup H2a1, the yellow couple at far right. How cool is this? One of those ancestors (or a combined segment from both) has been passed intact to me today. This is not a trivial segment either at 23.3 cM. I would not expect a segment passed to 5th cousins to be that large, but it is!

Also, note that the grey segment of DNA from Lazarus Estes (1848-1918) and Elizabeth Vannoy (1847-1918) is sitting slightly to the left of the dark blue segment from William Crumley III, so part or all of the grey or blue segment may originate with a different ancestor. Perhaps we’ll know more when additional people test and match on this same segment.

Double Related

I have one person who is related to me through two different lines. I need a way to determine which line (or both) our common DNA segment descends from.

I painted the segment for both of our common ancestor couples. The pink is George Dodson (1702-1770) & Margaret Dagord. The bright blue segment is William Crumley III (1788-1859) & Lydia Brown.

Those two lines don’t converge, at least not that we know of.

Now, as I map additional people, I’ll watch this segment for a tie breaker match between the two ancestors. The gold is not a tie breaker because that’s my grandparents who are downstream of both the pink and blue ancestors.

Painted Ethnicity

23andMe does us the favor of painting our ethnicity segments and allowing us to download a file with those segments. Conversely, DNA Painter does us the favor of allowing us to paint that entire file at once.

I already know my two Native segments on chromosome 1 and 2 descend through my mother, because her DNA is Native in exactly the same location. In other words, in this case, my ethnicity segment does in fact phase to my mother, although that’s not always the case with ethnicity.

Multiple Acadian ancestors are also proven to be Native by both genealogical records and maternal and/or paternal haplogroups.

Therefore, I’ve painted my Native segments on my mother’s side in order to determine exactly from which ancestor(s) those Native segment descend.

Confirming Questionable Ancestors

One very long-standing mystery that seemed almost unsolvable was the identity of the parents of Elijah Vannoy (1784->1850). We know he was the son of one of 4 Vannoy brothers living in Wilkes County, NC. Two were eliminated by existing Bibles and other records, but the other two remained candidates in spite of sifting through every available record and resource. We were out of luck unless DNA came to the rescue. Y DNA confirmed that Elijah was descended from one of the Vannoy males, but didn’t shed light on which one.

I decided that the wives would be the key, since we knew the identity of all four wives, thankfully. Of course, that means we’d be using autosomal DNA to attempt to gather more information.

I entered one candidate couple at Ancestry as Elijah’s parents – the one I felt most likely based on tax records and other criteria – Daniel Vannoy and Sarah Hickerson.  I also entered Sarah’s parents, Charles Hickerson (c 1725-<1793) and Mary Lytle.

I began getting matches to people who descend from Charles Hickerson and Mary Lytle through children other than Sarah.

The grey segment is from a descendant of Lazarus Estes & Elizabeth Vannoy. The salmon segments are from descendants of Charles Hickerson and Mary Lytle.

These segments aren’t small, 12.8 and 16.1 cM, so I’m fairly confident that these multiple segments in combination with the Elizabeth Vannoy segment do indeed descend from Charles Hickerson and Mary Lytle.

At Ancestry, I have 5 matches to Charles Hickerson and Mary Lytle through three of their children. However, only two of the individuals has transferred their results to either Family Tree DNA, MyHeritage or GedMatch where segment information is available to customers.

Finally, the thirty year old mystery is solved!

Shifting, Sliding, Offset or Staggered Segment Groups

Occasionally, you can prove an entire large segment by groups of shifting or sliding segments, sometimes referred as offset or staggered segments.

The entire bright pink region is inherited from Jacob Lentz (1783-1870) and Fredericka Reuhl (1788-1863.) However, it’s not proven by one individual but by a combination of 6 people whose segments don’t all overlap with each other.  The top two do match very closely with me and each other, then the third spans the two groups. The bottom 3 and part of the middle segment match very closely as well.

I can conclude that the entire dark pink region from left to right descends from Jacob and Fredericka.

Two Matches – 7 Generations

Two matches is all it took to identify this segment back to George Dodson and Margaret Dagord.

The mustard match is to my grandparents (22cM), and the pink match is to George Dodson (1702-1770) and his wife (22cM) – 7 generations. These people also match each other.

Additional matches would make this evidence stronger, although a 22cM triangulated match is very significant alone. Future might also suggest ancestors further back in time.

First Chromosome Fully Mapped

I actually have chromosome 5 entirely mapped to confirmed ancestors. I’m so excited.

Uh Oh – Something’s Wrong

I found a stack that clearly indicates something is wrong.  The question is, what?

The mustard represents my paternal grandparents, so these segments could have come through either of them, although on the pedigree chart below, we can see that this came through my grandfathers line..

There is only a small overlap with the magenta (Nicholas Speak 1782-1852 and Sarah Faires 1786-1865) and green (James Crumley 1711-1764 and Catherine c1712-c1790,) which could be by chance given that the Nicholas segment is 7.5 cM, so I’m leaving the magenta out of the analysis.

However, the rest of these segments overlap each other significantly, even though they are stepped or staggered.

As you can see from the colors on the pedigree chat, it’s impossible for the green segment to descend from the same ancestor as the purple segment. The purple and orange confirm that branch of the tree, but the red cannot be from the same ancestor or the same line as the green ancestor.

I suspect that the purple and orange line is correct, because there are 4 segments from different people with the same ancestral line.

This means that we have one of the following situations with the red and green segments:

  • The smaller segments are incorrect, false positives, meaning matching by chance. The green segment is 14 cM, so quite large to match by chance. The red segment is 10 cM. Possible, but not probable.
  • The segments are population-based matches, so appear in all 3 lines. Possible, technically, but also not probable due to the segment size.
  • The segments are genuine matches, and one of the lines is also found in one of the other lines, upstream. This is possible, but this would have to be the case with both the red and green lines. To continue to weigh this possibility, I’ll be watching for similar situations with these same ancestors.
  • Some combination of the above.

I need more matches on this segment for further clarity.

Visual Phasing – Crossovers

A crossover point is where the DNA on one side of a demarcation line is descended from one ancestor and the DNA on the other side is descended from another ancestor, represented by the pink and blue halves of the segment, below.

Crossovers occur when the DNA is combined from two different ancestors when it is passed to the child. In other words, a chunk of mom’s ancestors’ DNA is contributed by mom and a chunk of dad’s ancestors’ DNA is contributed as well. The seam between different ancestor’s DNA pieces is called a crossover.

In this example, the brown lines confirmed by several testers to be from Henry Bolton (c1759-1846) and Nancy Mann (c1780-1841) is shown with a very specific left starting point, all in a vertical line. It looks for all the world like this is a crossover point. The DNA to the left would have been contributed by another, as yet unidentified, ancestor.

The gold lines above are matches from more recent generations.

Naming Those Unnamed Acadians

My Acadian ancestry is hopelessly intertwined, but chromosome painting may in fact provide me with some prayer of unraveling this ball of twine. Eventually.

When I know that someone is Acadian, but I can’t tell which of many lines I connect through, I add them as “Acadian Undetermined.”

There’s a lot of Acadian DNA, because it’s an endogamous population and they just keep passing the same segments around and around in a very limited population.

On my maternal chromosome, all of the olive green is “Acadian Undetermined.”  However, that blue segment in the stack is Rene de Forest (1670-1751) and Francoise Dugas (1678->1751).

In essence, this one match identified all of the DNA of the other people who are now simply a row in the Acadian Undetermined stack. Now I need to go back and peruse the trees of these individuals to determine if they descend form this line, or a common ancestor of this line, or if (some of) these matches are a matter of endogamy.

Endogamous matches can be population based, meaning that you do match each other, but it’s because you share so much of the same DNA because you have small pieces of many common ancestors – not because a particular segment comes from one specific ancestor. You can also share part of your DNA from Mom’s side and part from Dad’s side, because both of your parents descend from a common population and not because the entire segment comes from any particular ancestor.

On some long cold winter weekend, I’ll go through and map all of the trees of my Acadian matches to see what I can unravel. I just love matches with trees. You just can’t do something like this otherwise.

Of course, those Acadians (and other endogamous populations) can be tricky, no matter what, one click up from a needle in a haystack.

Acadian Endogamy Haystack on Steroids

At first, our haystack looks like we’ve solved the mystery of the identity of the stack.  However, we soon discover that maybe things aren’t as neat and tidy as we think.

Of course, the olive green is Acadian Undetermined, but the three other colored segments are:

  • Pink – Guillaume Blanchard (1650-1715/17) & Huguette Goujon (c1647-1717)
  • Brown/Pink – Francois Broussard (c1653-1716) & Catherine Richard (c1663-1748)
  • Coffee – Daniel Garceau (1707-1772) & Anne Doucet (1713-1791)

Looking at the pedigree chart, we find two of these couples in the same lineage, so all is good, until we find the third, pink, couple, at the bottom.

Clearly, this segment can’t be in two different lines at once, so we have a problem.  Or do we?

Working the pink troublesome lines on back, we make a discovery.

We find a Blanchard line consisting of Guilluame Blanchard born circa 1590 and Huguette Poirier also born circa 1690.

Interesting. Let’s compare the Guillaume Blanchard and Huguette Goujon line. Is this the same couple, but with a different surname for her?

No, as it turns out, Guillaume Blanchard that married Huguette Goujon was the grandson of Guilluame Blanchard and Huguette Poirier. That haystack segment of DNA was passed down through two different lines, it appears, to converge in three descendants – me, the descendant of the pink segment couple and the descendant of the brown/burgundy segment couple. This segment reaches back in time to the birth of either Guilluame Blanchard or Huguette Poirier in 1590, someplace in France, rode over on the ship to Port Royal in the very early 1600s, probably before Jamestown was settled, and has been kicking around in my ancestors and their descendants ever since.

This 18 or so cM ancestral segment is buried someplace at Port Royal, Nova Scotia, but lives on in me and several other people through at least two divergent lines.

The X Chromsome

Several vendors don’t report the X chromosome segments. I do use X segments from those who do, but I utilize a different threshold because the SNP density is about half of that on the other chromosomes. In essence, you need a match twice as large to be equivalent to a match on another chromosome..

Generally, I don’t rely on segments below 10 for anyone, and I generally only use segments over 14cM and no less than 500 SNPs.

Having just said that, I have painted a few smaller segments, because I know that if they are inaccurate, they are very easy to delete. They can remain in speculative mode. The default for DNAPainter and that’s what I use.

The great thing about the X chromosome is that because of it’s special inheritance path, you can sometimes push these segments another 2 generations back in time.

Let’s use an X chromosome match in conjunction with my X fan chart printed through Charting Companion.

On the paternal X, I inherited the gold segment from the couple, William George Estes (1873-1971) & Ollie Bolton (1874-1955.) However, since my father didn’t inherit an X from William George Estes (because my father inherited the Y from his father,) that X segment has to be from Ollie Bolton, and therefore from her parents Joseph Bolton (1853-1920) and Margaret Claxton (1851-1920.)

The segment from Lazarus Estes (1848-1918) and Elizabeth Vannoy (1847-1918) that’s 14 cM is false. It can’t descend from that couple. Same for the 7.5 cM from Jotham Brown (c1740-c1799) & Phoebe unk (c1747-c1803.) That segment’s false too. The green 48 cM segment from Samuel Claxton (1827-1876) and Elizabeth Speak (1832-1907)?  That segment’s good to go!

On my mother’s side, there’s a 7.8 cM Acadian Undetermined, which must be false, because Curtis Benjamin Lore (1856-1909) did not inherit an X chromosome from his Acadian father, Antoine Lore (1805-1862/67.)  Therefore, my X chromosome has no Acadian at all. I never realized that before, and it makes my X chromosome MUCH easier.

How about that light green 33cM segment from Antoine Lore (1805-1862/67) & Rachel Hill (1814/15-1870/80)? That segment must come from Rachel Hill, so it’s pushed back another generation to Joseph Hill (1790-1871) and Nabby Hall (1792-1874.)

I love the X chromosome because when you find a male in the line, you automatically get bumped two more generations back to his mother’s parents. It’s like the X prize for genetic genealogy, pardon the pun!

Adoptees

Some adoptees are lucky and receive close matches immediately. Others, not so much and the search is a long process.

If you’re an adoptee trying to figure out how your matches connect together, use in-common-match groupings to cluster matches together, then paint them in groups.  Utilize the overlapping segments in order to view their trees, looking for common surnames. Always start with the groups with the longest segments and the most matches. The larger the match, the more likely you are to be able to find a connection in a more recent generation. The more matches, the more likely you are to be able to spot a common surname (or two.)

Painting can speed this process significantly.

Much More Than Painting

I hope this tour through my colorful chromosomes has illustrated how much fun analysis can be. You’ll have so much fun that you won’t even realize you’re triangulating, phasing and all of those other difficult words.

If you have something you absolutely have to do, set an alarm – or you’ll forget all about it. Voice of experience here!

So, go and find some segments to paint so all of these exciting things can happen to you too!

How far back will you be able to identity a segment to a specific ancestor?  How about a triangulated segment? An X segment?

Have fun!!! Don’t forget to eat!

PS – If you’d like to learn more about Phasing, Triangulation or hear my keynote speech, consider signing up for the Virtual DNA Conference June 21-24. I’ll be presenting on both of those topics. You can sign in anytime for the next year to listen to the sessions, not just during the conference days. The keynote will be recorded and available afterwards as well.

_____________________________________________________________________

Standard Disclosure

This standard disclosure appears at the bottom of every article in compliance with the FTC Guidelines.

Hot links are provided to Family Tree DNA, where appropriate.  If you wish to purchase one of their products, and you click through one of the links in an article to Family Tree DNA, or on the sidebar of this blog, I receive a small contribution if you make a purchase.  Clicking through the link does not affect the price you pay.  This affiliate relationship helps to keep this publication, with more than 900 articles about all aspects of genetic genealogy, free for everyone.

I do not accept sponsorship for this blog, nor do I write paid articles, nor do I accept contributions of any type from any vendor in order to review any product, etc.  In fact, I pay a premium price to prevent ads from appearing on this blog.

When reviewing products, in most cases, I pay the same price and order in the same way as any other consumer. If not, I state very clearly in the article any special consideration received.  In other words, you are reading my opinions as a long-time consumer and consultant in the genetic genealogy field.

I will never link to a product about which I have reservations or qualms, either about the product or about the company offering the product.  I only recommend products that I use myself and bring value to the genetic genealogy community.  If you wonder why there aren’t more links, that’s why and that’s my commitment to you.

Thank you for your readership, your ongoing support and for purchasing through the affiliate link if you are interested in making a purchase at Family Tree DNA, or one of the affiliate links below:

Affiliate links are limited to:

Family Tree DNA Names 100,000 New Y DNA SNPs

Recently, Family Tree DNA named 100,000 new SNPs on the Y DNA haplotree, bringing their total to over 153,000. Given that Family Tree DNA does the majority of the Y DNA NGS “full sequence” testing in the industry with their Big Y product, it’s not at all surprising that they have discovered these new SNPs, currently labeled as “Unnamed Variants” on customers’ Big Y Results pages.

The surprising part was twofold:

Family Tree DNA single-handedly propelled science forward with the introduction of the Big Y test. They likely have performed more NGS Y chromosome tests than the entire rest of the world combined. Assuredly, they have commercially.

Originally, in the early 2000s, a new SNP wasn’t named until there were three independent instances of discovery. That pre-NGS “rule” didn’t take into account three men from the same family line because very few men had been tested at that point in time, let alone multiple men from the same family. This type of testing was originally only done in an academic environment. A caveat was put into place by Family Tree DNA when they started discovering SNPs that the 3 individuals had to be from separate family lines and the SNP in question had to be verified by Sanger sequencing before being considered for name assignment and tree placement. At that time, they were pushing the scientific envelope.

In recent years, that criteria changed to two individuals. With this new development, the SNP is being named with one reliable occurrence, BUT, the SNP still is not being placed on the tree without two high quality occurrences.

Naming the SNPs early while awaiting that second occurrence allows discussion about the validity of that particular finding. Family Tree DNA was not the first to move to this practice.

Some time ago, two other firms began analyzing the BAM files produced by Family Tree DNA for an additional analysis fee. Those firms began naming SNPs before three occurrences had been documented, a practice which has been well-accepted by the genetic genealogy community. Everyone seems to be anxious to see their SNP(s) named and placed on the tree, although there is little consensus or standardization about the criteria to place a SNP on the tree or the line between high, medium and low quality SNP read results.

The definition of a new haplogroup, meaning a high quality named SNP, is a new branch in the Y tree. Every new SNP mutation has the potential to be carried for many generations – or to go extinct in one or two.

As the industry has matured, SNP naming procedures have evolved too.

How SNP Names Are Assigned

The lab or entity that discovers a SNP gets to name the SNP. That means that their abbreviation is appended to the beginning of the SNP number, thereby in essence crediting that entity for the discovery. Clearly more conservative namers can’t append their initials to nearly as many SNPs as aggressive namers.

Here’s a list of the naming entities, maintained by ISOGG.

In 2006, the first year that ISOGG compiled a SNP tree, the number of Y DNA haplogroups was 460, including singletons, not tens of thousands. No one would ever have believed this SNP tsunami would happen, let alone in such a short time.

Naming SNPs

Family Tree DNA waiting to name SNPs until 3 were discovered in unrelated family lines, and requiring confirmation by Sanger sequencing allowed the analysis entities to “discover” and name the SNP with their own preceding prefix by implementing less stringent naming criteria. It also increased the possibility of dual naming, a phenomenon that occurs when multiple entities name the same SNP about the same time.

Some people who maintain trees list all of these equivalent SNPs that were named for the exact same mutation, at the same time. Family Tree DNA does not. If the same SNP is named more than once, Family Tree DNA selects one to name the tree branch – in the example below, ZP58. Checking YBrowse, this SNP was also named FGC11161 and ZP56.2.

However, you can see, that SNP ZP58 has several other SNPs keeping it company on the same branch, at least for now.

The FGC SNPs above are only assigned as branch equivalents of ZP58 until a discovery is made that will further divide this branch into two or more branches. That’s how the tree is built.

Sometimes defining a unique SNP is not as straightforward as one would think, especially not utilizing scan technology.

While YFull doesn’t do testing, Full Genomes Corporation does. All of the YFull named SNPs are a result of interpreting BAM files of individuals who have tested elsewhere and naming SNPs that the testing labs didn’t name.

Today, YBrowse, also maintained by ISOGG in conjunction with Thomas Krahn shows the following three organizations with the highest named SNP totals:

  • Family Tree DNA – BY and L prefixes, (L from before the Big Y test) – 153,902
  • YFull – Y prefix – 133,571 (plus 6447 YP SNPs submitted by citizen scientists for verification)
  • Full Genomes Corporation – FGC prefix – 81,363

Just because a SNP is named doesn’t mean that it has been placed on the haplotree. Today, Family Tree DNA has just over 14,100 branches on their tree, with a total of 102,104 SNPs (from all naming sources) placed on their tree. That number increases daily as the following placement criteria is met:

  • Read quality confirmed by the lab
  • Two or more instances of the SNP

SNPs Applied to Family History

All SNPs discovered through the Big Y process and named by Family Tree DNA begin with BY, so my Estes lineage is BY490. This mutation (SNP) occurred since Robert Eastye born in 1555, because one of his son’s descendants carries only BY482 and the descendants of another son carry BY490.

In the pedigree above, kit 166011, to the far right is BY482 and the rest are all BY490, which is one mutation below BY482 on the haplotree.

This means of course that the mutation BY490, occurred someplace between the common ancestor of all of these men, Robert Eastye born in 1555, and Abraham Estes born in 1647. All of Abraham’s descendants carry BY490 along with BY482, but kit 166011 does not. Therefore, we know within two generations of when BY490 occurred. Furthermore, if someone descended from one of Abraham’s brothers (Robert, Silvester, Thomas, Richard, Nicholas or John,) represented on this chart by Richard, we could tell from that result if the mutation occurred between Robert and Silvester, or between Silvester and Abraham.

Unnamed Variants Versus Named SNPs

As it turns out, reserving a location for the Unnamed Variants in the SNP tree is much like making a dinner reservation. It’s yours to claim, assuming everyone shows up.

In the case of Unnamed Variants, Family Tree DNA reserved the SNP name and the SNP will be placed on the tree as soon as a second occurrence is discovered and the SNP is entirely vetted for quality and accuracy. Palindromic and high repeat regions were excluded unless manually verified.

While this article isn’t going to delve into how to determine read quality, every SNP placed on the tree at Family Tree DNA is individually evaluated to assure that they are not being placed erroneously or that a “mutation” isn’t really a misalignment or read issue.

Currently, Family Tree DNA is working their way through the entire haplotree, placing SNPs in the correct location. As you can see, they have more than 100,000 to go and more SNPs are discovered every day.

In the case of the Estes men, you can see their branch placement in the much larger tree.

As we learn more, sometimes branch placements move.

Is Your Unnamed Variant on the List?

ISOGG maintains an index of BY SNPs. BY of course equates to Big Y.

Before using the index, you first need to sign on to your Family Tree DNA account and look at your Unnamed Variants on your Big Y personal page.

If you don’t have any Unnamed Variants, that means all of your Unnamed Variants have already been named. Congratulations!

If you do have Unnamed Variants, click on the position number to take a look on the browser.

This unnamed variant result is clearly a valid read, with almost every forward and reverse read showing the same mutation, all high-quality reads and no “messy” areas nearby that might suggest an alignment issue. You can read more about how to work with your Big Y results in the article, Working With the New Big Y Results (hg38).

Next, go to the ISOGG BY Index page and enter the position number of the variant in the search box – in this case, 13311600.

In this case, 13311600 is not included in the BY Index because YFull already beat Family Tree DNA to the punch and named this SNP.

How do I know that? Because after seeing that there was no result for 13311600 on the ISOGG page, I checked YBrowse.

You can utilize YBrowse to see if an Unnamed Variant has previously been named. You can see the SNP name, Y93760, directly above the left side of the red bar below. The “Y” of course tells you that YFull was the naming entity. (Note that you can click on any image to enlarge.)

YBrowse is more fussy and complex to use than doing the simple ISOGG search. You only need to utilize YBrowse if your Unnamed Variant isn’t listed in the BY ISOGG search tool.

To use YBrowse successfully, you must enter the search in the format of “chrY:13311600..1311600” without the quotation marks and where the number is the variant location, and then click search.

The next Unnamed Variant, 14070341, is included in the ISOGG search list, so no need to utilize YBrowse for this one.

To see the new name that this SNP will be awarded when/if it’s placed on the tree, click on the link “BY SNPs 100K.” You’ll see the page, below.

Then, scroll down or use your browser search to find the variant location.

There we go – this variant will be named BY105782 as soon as Family Tree DNA places it on the tree! I’ll be watching!

Where will it be located on the tree, and will it be the new Estes terminal SNP, meaning the SNP that defines our haplogroup? I can’t wait to find out! It’s so much fun to be a part of scientific discovery.

If you’re a male and haven’t taken the Big Y test, it’s on sale now for Father’s Day. You can play a role in scientific discovery too. Does your Y DNA carry undiscovered SNPs?

A big thank you to Family Tree DNA for making resources available to answer questions about their new SNPs and naming processes.

___________________________________________________________________

Standard Disclosure

This standard disclosure appears at the bottom of every article in compliance with the FTC Guidelines.

Hot links are provided to Family Tree DNA, where appropriate. If you wish to purchase one of their products, and you click through one of the links in an article to Family Tree DNA, or on the sidebar of this blog, I receive a small contribution if you make a purchase. Clicking through the link does not affect the price you pay. This affiliate relationship helps to keep this publication, with more than 900 articles about all aspects of genetic genealogy, free for everyone.

I do not accept sponsorship for this blog, nor do I write paid articles, nor do I accept contributions of any type from any vendor in order to review any product, etc. In fact, I pay a premium price to prevent ads from appearing on this blog.

When reviewing products, in most cases, I pay the same price and order in the same way as any other consumer. If not, I state very clearly in the article any special consideration received. In other words, you are reading my opinions as a long-time consumer and consultant in the genetic genealogy field.

I will never link to a product about which I have reservations or qualms, either about the product or about the company offering the product. I only recommend products that I use myself and bring value to the genetic genealogy community. If you wonder why there aren’t more links, that’s why and that’s my commitment to you.

Thank you for your readership, your ongoing support and for purchasing through the affiliate link if you are interested in making a purchase at Family Tree DNA, or one of the affiliate links below:

Affiliate links are limited to:

DNAPainter – Mining Vendor Matches to Paint Your Chromosomes

This isn’t quite the same as when my mother used to talk about painting the town, but in genetic genealogy terms, it’s better.

This is the second of 4 articles that will describe how to use DNA Painter.

Today, I’d like to talk about how I utilize the various vendor testing tools combined with DNAPainter to “mine my DNA,” or better put, to mine my ancestor’s DNA which is now mine, pun intended.

To review instructions for how to set up and use the DNA Painter tool, please read DNA Painter – Chromosome Sudoku for Genetic Genealogy Addicts and then come back here to proceed.

I’m going to discuss each vendor’s tools and how I’ve used them, sometimes in combination.

57% Painted

Please note that you can click on any image to enlarge

Is this not a beautiful thing to behold? That’s my ancestors, in loving color, looking back at me, on MY chromosomes.

I’m completely thrilled that I have managed to paint 57% of my chromosomes. I’m a visual person, and while I’ve worked with spreadsheets now for years, I’ve officially abandoned them. Ok, mostly.

Yes, you heard me right – I’ve abandoned the spreadsheets in favor of DNA Painter, at least for segments where I can positively identify an ancestral couple. In other words, those segments that can be reliably mapped.

That 57% is made up of 445 segments in total, split between my maternal and paternal sides. That’s without counting my mother’s DNA. While I do utilize matching to my mother in order to be sure that a match is really a valid match, I didn’t paint her DNA. Obviously, I’m going to match her 100%, and DNA painter already breaks chromosomes into my pink maternal and blue paternal sides.

Key Elements

  1. The single best thing you can do in order to paint your chromosomes is to have known family members and cousins test. You can then paint their DNA that matches yours, attributing it to their identified family line.
  2. The second best thing you can do is to work with your matches using their trees to identify your common ancestor.

Now, you’re ready to begin painting.

I’m going to step through the process I used at each vendor to identify paintable segments.

I did not paint segments that I could not identify to an ancestral line, except for my endogamous Acadian line which I labeled simply as Acadian to mark those segments that I can identify as Acadian, but I can’t identify a specific ancestor, or ancestors. When I can identify the Acadian ancestor, I paint that segment using the ancestors’ names.

Family Tree DNA

At Family Tree DNA, I begin with my closest matches that are not immediate family – meaning not my parents, children or grandchildren. I’m looking for aunts, uncles, cousins, etc. I don’t paint siblings, but often half siblings are extremely useful because they can help you identify which paternal side other matches are related to.

In the first DNA Painter article, I explained how to utilize the Family Tree DNA chromosome browser to select an individual whose matching DNA can be displayed so that you can copy and paste that segment into the painting feature of DNA Painter.

On your results page, your “bucketed individuals” who have been assigned as maternal (pink icon above) or paternal (blue icon not shown) can be a huge clue when used in conjunction with the in-common-with (ICW) tool and the matrix.

You can also search by ancestral surname and then evaluate each match through common surnames, trees and other resources. If you’re not familiar with how to use the tools at Family Tree DNA, here’s a quick run-through.

Select the individual whose DNA you wish to paint, view in the chromosome browser, then copy and paste from the grid below to the DNAPainter tool.

I painted the matching DNA of all the people whose common ancestor with me I could positively identify before moving on to the next vendor.

Who Have I Painted?

As you begin to paint segments from multiple vendors, you may wonder if you’re finding duplicates. It’s easy to tell. At DNA Painter, click on “All segment data,” below the legend in the bottom right corner.

This displays the entire list of matches whose DNA you have painted, in spreadsheet format. You can sort by match name or simply do a browser search. (CTRL+F)

You can also download this data into a cvs (Excel compatible) file at the top left of this page.

Avoiding Duplicates

As you view and paint your matches at the various vendors, you may discover that you have already found a match with that person at another vendor, either because they tested there or uploaded their autosomal file. When possible, avoid duplicate painting. It won’t help anything and will just clutter your chromosomes. You may not always be able to identify a match as a duplicate, especially if the tester utilizes a pseudonym at various locations. Don’t’ worry though, because you can always easily delete it later and a duplicate person/segment certainly won’t hurt anything.

Ok, now to our next vendor! Let’s find more segments to paint.

MyHeritage

At MyHeritage, click on DNA matches.

At the right of the search box, fly over the little pink key (or funnel) looking thing and you’ll see the option for “Has Smart Matches.” That’s what you’re looking for.

Click on the key icon.

Smart Matches mean that your DNA matches and you have a common ancestor in your trees. Click on the purple button to review this DNA match.

For each match, scroll all the way down to the bottom where your matching chromosome segments will be colored.

At the right, above the chromosome browser, click on “advanced options” which will allow you to select “download shared DNA info.” You need to download to your system so that you can copy and paste the matching segment information to DNA Painter.

MyHeritage has a few more columns than necessary, and DNA Painter can’t utilize them. Delete the columns for Name, Match Name, RSID beginning and end, and also eliminate SNPs due to an overestimation issue. In many cases, the SNPs at MyHeritage are twice or more than the number of SNPs when comparing the same segment at other vendors.

Now that your segment is cleaned up, copy the entire group shown above, minus the yellow columns which you’ve deleted, and paste into the DNA Painter spreadsheet.

MyHeritage has recently added a triangulation feature, shown at the far right, below, indicating that these two people individually triangulate with me and Alberta. The icon at far right of “5th cousin” indicates triangulation.

By clicking on the triangulation icon, you then see how that person triangulates with both your match and you – in this case, me, Alberta, and Chandler.

You may choose to paint triangulated segments, BUT, the size of the triangulated segment is often going to be smaller than the amount of DNA than you match individually to either one or both people.

In the example above, you can see that you match the pink person on a significantly longer segment than you match the tan person. The amount of DNA where you match both the pink and tan person is smaller yet, because the area where you match the tan person extends beyond where you match the pink person and vice versa. If you were going to paint ONLY the triangulated segments, you would paint only the portion that is both pink and tan, “boxed” above.

I don’t recommend painting ONLY triangulated segments, because you’ll be depriving yourself of the ability for each person to match others on the portions of the segments on which they match you, but not the other person in question.

In this example, utilizing DNA Painter, you’ll see that people in fact match you AND the pink person on several segments. The segment shown in pink, at MyHeritage, above, is shown on chromosome 5 in DNA Painter as the long mustard colored segment. Look at how many people match you on that segment. This is why we don’t paint only the triangulated portions of the chromosome. That long mustard segment match will triangulate with many people on smaller portions of that mustard segment, as evidenced by the yellow, grey, blue, cinnamon, purple and red segment matches..

DNA Painter helps you triangulate, so there is no reason to restrict your painting to triangulated segments.

Triangulation is a great tool, but don’t mix triangulated segments with matching segments in the same profile, at least not until you get the hang of the tool and using the multiple vendor’s results.

23andMe

Unfortunately, 23andMe doesn’t have tools like tree matching (MyHeritage) or maternal/paternal phasing (Family Tree DNA,) but they do allow testers to enter common surnames.

Looking at closer matches, meaning first, second or third cousins, if they list even a few surnames, you may well be able to identify the common genealogical line, especially in conjunction with ancestral locations and the other people you match in common.

Sometimes you can glean enough information to identify your common ancestor. In this case, even if I didn’t know Cheryl, the surname would have identified the ancestor. If that didn’t do it, the “in common” list below would!

Once you’ve identified the common ancestor and decide you’re ready to paint, click on the Tools tab at the top of your page and select DNA Relatives.

On the DNA Relatives tab, click on the relative whose DNA you wish to paint. I’m selecting my cousin, Cheryl.

Click on the blue DNA Comparison, in the upper right hand corner.

On the comparison screen, you will select yourself as one person and Cheryl as the other.

At the top you’ll see the two individuals and their overlapping segments painted onto chromosomes. Scroll down and you’ll see the segment detail, below.

Highlight the rows (they’ll turn blue, like above) and right click to copy the segment information.

The next step is to drop the results into a spreadsheet, just long enough to delete the first and last columns, shown in red below, then copy the remaining rows and paste into the DNA Painter tool.

Mining Ancestry Data at GedMatch

GedMatch is somewhat of a special case, because GedMatch doesn’t do DNA testing, but provides an open sharing platform by facilitating uploads of raw autosomal files from multiple other vendors. Therefore, anyone with results at GedMatch tested elsewhere. If you tested at all of the other vendors, it’s probable that you find people at GedMatch as a match that match you at other vendors too.

Because 23andMe does not support the uploading of Gedcom files, if your match has uploaded a Gedcom file to GedMatch, or connected to Geni or WikiTree, then you may be able to identify your common ancestor at GedMatch that you were not able to identify at 23andMe.

Conversely, if you match at Ancestry, you won’t be able to paint from Ancestry, because Ancestry does not provide segment information. We will talk about Ancestry as a special case next, but for now, let’s focus on how to utilize GedMatch.

At GedMatch, you’ll work in steps after setting your account up and uploading your raw data file from either:

If you tested elsewhere, or after August of 2017 at 23andMe, you will have to upload to a special section called GedMatch Genesis. GedMatch Genesis provides a sandbox area for files other than the ones listed above that are generally incompatible with those files and with each other. Genesis files often have few SNP locations in common and not enough to match reliably.

I do not recommend DNA painting utilizing segments from GedMatch Genesis.

GedMatch is currently merging their regular GedMatch service with the Genesis service, so I’m not entirely clear how you will tell the difference between the kits known to match reliably, mentioned above, and others after the merge.

Currently, kits with T prefix (Family Tree DNA), A (Ancestry) and M (23andMe) show version levels in the type field when you match in regular GedMatch. MyHeritage kits are processed by the Family Tree DNA lab. G kits used a generic upload, so you can’t tell where they originated.

Kits uploaded in the Genesis sandbox seem to be assigned double alpha letter kit prefixes at random. Genesis includes a “Testing Company” field which does not include a version number. Today, just stay with the regular GedMatch one-to many and one-to-one matching for DNA Painter.

First, you’ll want to perform a one-to-many match.

This page shows your closest 2000 results. In my case, truncating my matches at 12.7cM. This means if I want to see my results below 12.7 cM, I must subscribe to the Tier 1 Utilities in order to be able to display over 2000 matches.

We’ll discuss how to utilize Tier 1 matching in the Ancestry portion, next, but for now, we’ll just be working with the regular one-to-many matches report.

Of course, trusty cousin Cheryl has results here as well.

In order to compare Cheryl’s results to my own, I need to do two separate things:

  • Click on the A link under the Autosomal Details column (above) and/or
  • Click on the X link under the X DNA column

These two results, both of which are paintable, do not display together so must be selected separately.

By clicking on the A or X, GedMatch will display a one-to-one comparison. I leave this page (below) at the default values and simply click submit.

Your next screen will be a match grid.

Once again, select and copy the results, then paste into DNA Painter. If you also have an X match with this individual, return to the one-to-many match page and then click on the X link to repeat the same process for the X chromosome.

Ancestry Through GedMatch

As far as I’m concerned, the best thing about Ancestry matches is DNA shared ancestor hints (SAH) – meaning those green leaves visible near the green “view match” button which indicate that you share both DNA and a common ancestor(s) in your trees.

Followed immediately by the worst thing which is that Ancestry provides no segment data. However, pairing Ancestry with GedMatch can provide you with some segment information, although you do have to dig. That digging was certainly worthwhile for me, as I found several readily identifiable matches.

When I find a green leaf shared ancestor hint at Ancestry, I record as much information about that match as I can in a spreadsheet. The reason is twofold.

  • Ancestry hints tend to come and go, rather inexplicable, and I want to have that information someplace besides at Ancestry
  • I want to be able to view how many matches I have through specific ancestors which I can do in a spreadsheet by sorting.
  • I want to be able to mine GedMatch for segment information for people at Ancestry who have uploaded to GedMatch.

Note the RJE V2 results, a 6th cousin who I match at 6.6 cM, as we’ll be using that at GedMatch.

I maintain several columns in my Ancestry Match spreadsheet, as shown above. I track people who might be good Y or mitochondrial DNA candidates, as well as GedMatch numbers or other useful information.

I don’t utilize segments smaller than 7 cM for DNA Painter, BUT, Ancestry almost always under-reports the matching segment size due to their internal process which removes some segments that do match. Therefore, I search for all Ancestry matches in GedMatch and paint them if they are 7cM or over at GedMatch. You will match at Ancestry down to 6 cM. Since 7cM is the default GedMatch threshold, that works out well. I don’t find them if they are under 7cM at GedMatch, and I don’t care.

In my case to obtain segments smaller than 12.7 cM, because that is the cutoff where the free one-to-many GedMatch tool reaches the 2000 match threshold (for me,) I need to utilize the Tier 1 subscription utilities which are well worth every dollar.

The one-to-many match looks quite different for the Tier 1 tool.

You’ll need to play with this a bit to determine how high you need to set the limit to see all of your 7cM matches. In my case, I had to set it to 20,000.

I utilize two monitors, so I display my Ancestry spreadsheet on the first monitor and the GedMatch one-to-many match table on the second monitor.

Then, utilizing the browser’s search function, I search for any identifiable portion of the information for the Ancestry match at GedMatch.

In the first example, the user’s name is RJE V2. I search at GedMatch for “RJE” using “ctrl+F” which is the browser’s find function.

You can see that the search found a total of 3 different “RJE” entries. Looking at the first 2, you can see that one is labeled V4 and one is labeled V2. Typically, I would look at this and decide that the RJE V2 is the right match based on the user name at Ancestry.

However, look closer.

The RJE V2 at GedMatch has a much higher amount of shared DNA at 3587.1 cM total than the RJE V2 at Ancestry with a total of 6.6 cM. Clearly, this is not the same person, even though the user name is the same.

For all we know, a different person may have used the same user name, which is clearly an alias, noted by the “*”. Or the same person may have multiple kits at GedMatch.

However, in this case, the RJE V2 is not the same match.

However, let’s say that it is the same person and we’ve been able to reasonably identify the match. In order to compare one-to-one, click on the highlighted blue “largest segment” in the autosomal category, shown below.

If you want to compare the X one-to-one, click on the blue largest segment in that column.

From this point, the matching will look the same as the one-to-one GedMatch matching shown in the previous section – so copy and paste as normal.

While this certainly isn’t the most effective way of working with Ancestry matches, it’s really the only hope we have, unless your match has also uploaded to either Family Tree DNA or MyHeritage.

However, in my experience, I generally stand a better chance of identifying Ancestry matches at GedMatch because their user name or the user name of the person managing their account can be found much more readily. People sometimes tend to utilize the same abbreviations, names or nicknames in multiple locations.

Summary

While each vendor has unique strengths and weaknesses today, and GedMatch provides a platform used by some but not all, the best way to effectively paint your chromosomes is to utilize all of the tools available, and sometimes together. I strongly suggest that you test at or upload to each vendor, because you will find matches at each vendor that aren’t elsewhere.

How many segments can you paint on your chromosomes, and what will those segments tell you?

In the next article, I’ll be walking through my chromosome painting gallery to take a look at the hidden messages there! I hope you’ll come along so you can find some hidden messages of your own.

Enjoy!

_____________________________________________________________________

Standard Disclosure

This standard disclosure appears at the bottom of every article in compliance with the FTC Guidelines.

Hot links are provided to Family Tree DNA, where appropriate. If you wish to purchase one of their products, and you click through one of the links in an article to Family Tree DNA, or on the sidebar of this blog, I receive a small contribution if you make a purchase. Clicking through the link does not affect the price you pay. This affiliate relationship helps to keep this publication, with more than 900 articles about all aspects of genetic genealogy, free for everyone.

I do not accept sponsorship for this blog, nor do I write paid articles, nor do I accept contributions of any type from any vendor in order to review any product, etc. In fact, I pay a premium price to prevent ads from appearing on this blog.

When reviewing products, in most cases, I pay the same price and order in the same way as any other consumer. If not, I state very clearly in the article any special consideration received. In other words, you are reading my opinions as a long-time consumer and consultant in the genetic genealogy field.

I will never link to a product about which I have reservations or qualms, either about the product or about the company offering the product. I only recommend products that I use myself and bring value to the genetic genealogy community. If you wonder why there aren’t more links, that’s why and that’s my commitment to you.

Thank you for your readership, your ongoing support and for purchasing through the affiliate link if you are interested in making a purchase at Family Tree DNA, or one of the affiliate links below:

Affiliate links are limited to:

DNAGedcom Client

DNAGedcom provides an incredibly cool tool that has helped me immensely with my genealogy research, particularly at Ancestry and Family Tree DNA. This tool doesn’t replace what Ancestry and Family Tree DNA provide, but augments the functionality significantly.

I’ve been frustrated for months by the broken search function at Ancestry, and the DNAGedcom tool allows you to bypass the search function entirely by downloading the direct line ancestral information for all of your matches. So let’s use my Ancestry account as an example.

Utilizing DNAGedcom

After installing the DNAGedcom tool on your system, sign on to your Ancestry account through the tool. The tool downloads all of your matches, the people you match in common with them, and the ancestors in your matches’ trees.

The best part about this is that the results are then in a spreadsheet file that you can simply sort utilizing normal spreadsheet functions. I wrote about using spreadsheets for genetic genealogy in the article, Concepts – Sorting Spreadsheets for Autosomal DNA.

In my case, this means I can see everyone who I match that has an Estes, or any other surname, in their tree. I don’t have to look at my matches’ trees one at a time.

You can read about this very cool tool at this link, including how to subscribe for either $5 per month or $50 per year. Many functions at DNAGedcom are free, but the Ancestry tool is available through a minimal subscription which helps to support the rest of the site.

After subscribing, the DNAGedcom client will become available to you on your subscriber page at DNAGedcom.

Please note that you can click to enlarge any image.

After you subscribe, you’ll see the link for the Ancestry download tool, along with other resources.

You will want to follow the installation directions, exactly, to download the DNAGedcom client onto your PC or Mac in preparation for downloading your Ancestry match information onto your system. This is painless and goes quickly.

Next, you will be prompted to sign in to both DNAGedcom and Ancestry, through the tool, and then you will be prompted for three separate steps at Ancestry:

  • Gather Matches – took about 10 minutes
  • Gather Trees – let’s just say you might want to run this one overnight, and on a directly connected system, not wifi. Mine was about 25% complete at the 2 hour mark
  • Gather ICW – another several hours, but you can do other things on your system at the same time

The downloaded files will be stored on your computer as .csv files. On my PC, the default location was in the Documents directory and the files are named as follows:

  • a_Roberta_Estes (the ancestors of my matches)
  • icw_Roberta_Estes (the people I match and who I match in common with them)
  • m_Roberta_Estes (information about the match, such as cMs, etc.)

It’s important to make a note of this, as I didn’t find the file names documented elsewhere.

The good news is that even though these steps take a long time, having all of this information in a place where you can sort it and use it effectively is extremely useful. You can run the various steps at night or when you aren’t otherwise using your system.

In addition, if someone is sharing their DNA results with you on Ancestry (which they can under the settings gear), you can download the same data for their account – and then you can look for commonalities between groups of results using the DNAGedcom Match-O-Matic tool, also described in the introductory document.

Using the Downloaded Files

Personally, what I wanted to do was to search for all occurrences of a particular surname. Fortunately, it was Claxton or Clarkson, not Smith.

Simply using Excel (after saving the results file in Excel format), I was able to quickly sort for these surnames, an example shown below. Hmmm, I wonder if Claxon is relevant too. I never considered that possibility – nor would I have ever seen Claxon in a surname search, because I wouldn’t have searched for Claxon..

I’m brick walled on the Claxton line in Russell County, Virginia in about 1799. My ancestor, James Lee Claxton, was born someplace in Virginia about 1775. Utilizing Y DNA, we know of another man, also named James Claxton, born about 1750 first found in Granville and Bertie County, NC, who sired an entire lineage of Claxtons who migrated to Bedford County, TN.  However, that James is not the father of my ancestor, because that James had a different son named James. Other than these two distinct groups, we can’t seem to match with anyone else who has tested their Y DNA at Family Tree DNA, so my hope, for now, is an autosomal match with a known Claxton line out of Virginia.

(Shameless plug – if you are a Claxton or Clarkson male, please test your Y DNA at Family Tree DNA and join the Claxton DNA project. If you have Claxton or Clarkson ancestry from any line, and have taken the Family Finder test or transferred autosomal results from another vendor, please join the Claxton/Clarkson DNA project at Family Tree DNA. If you have Claxton or Clarkson ancestry and haven’t yet DNA tested, please do.)

Therefore, my goal is to find matches to other Claxton or Clarkson individuals who don’t share a known common known ancestor with me. Because we don’t share a known common ancestor, of course, these people would never be shown as an Ancestry green leaf “DNA+tree match,” nor is there another way for me to obtain a surname list like this at Ancestry.

After finding Claxton candidates, then I can refer to the other downloaded files or sign on to my account at Ancestry to look at the match itself and other ICW matches. Hopefully, some of my matches will also match some of my Claxton cousins as well, which would suggest that the match might actually be through the Claxton line.

The DNAGedcom client also downloads the same type of information from 23andMe, which isn’t nearly as useful without trees, as well as from Family Tree DNA.

Thanks so much to www.dnagedcom.com.

_____________________________________________________________________

Standard Disclosure

This standard disclosure appears at the bottom of every article in compliance with the FTC Guidelines.

Hot links are provided to Family Tree DNA, where appropriate. If you wish to purchase one of their products, and you click through one of the links in an article to Family Tree DNA, or on the sidebar of this blog, I receive a small contribution if you make a purchase. Clicking through the link does not affect the price you pay. This affiliate relationship helps to keep this publication, with more than 850 articles about all aspects of genetic genealogy, free for everyone.

I do not accept sponsorship for this blog, nor do I write paid articles, nor do I accept contributions of any type from any vendor in order to review any product, etc. In fact, I pay a premium price to prevent ads from appearing on this blog.

When reviewing products, in most cases, I pay the same price and order in the same way as any other consumer. If not, I state very clearly in the article any special consideration received. In other words, you are reading my opinions as a long-time consumer and consultant in the genetic genealogy field.

I will never link to a product about which I have reservations or qualms, either about the product or about the company offering the product. I only recommend products that I use myself and bring value to the genetic genealogy community. If you wonder why there aren’t more links, that’s why and that’s my commitment to you.

Thank you for your readership, your ongoing support and for purchasing through the affiliate link if you are interested in making a purchase at Family Tree DNA.

Working with the New Big Y Results (hg38)

If you are a Family Tree DNA customer, and in particular, a male or manage male kits, you’re familiar with the Big Y test.

The Big Y test scans the entire gold standard region of the Y chromosome, hunting for mutations, called SNPs, that define your haplogroup with great precision. This test also discovers SNPs never before found.  Those newly discovered SNPs may someday become new haplogroup branches as well. The Big Y test is how the Y DNA phylotree has been expanded from a few hundred locations a few years ago to more than 78,000, and along with that comes our understanding of the migration patterns of our ancestors.

We’re still learning, every single day, so testing new people continues to be important.

The Big Y is the logical extension of STR testing (panels 37, 67 and 111), which focus on genealogical matches, closer in time, instead of haplogroup era matches. STR locations mutate more rapidly than SNPs, so the STR test is more useful for genealogists, or at least represent an entry point into Y DNA testing. SNPs generally reach further back in time, showing us where are ancestors were before STR test results kick in.  More and more, those two tests have some time overlap as more SNPs are discovered.

If you want to read more, I wrote about this topic in the article, “Why the Big Y Test?”.  Ignore the pricing information at the end of that article, as it’s out of date today.

Before we talk about the new format of the Big Y results, let’s take a step back and look at the multiple reasons why Family Tree DNA created a new Big Y experience.

The first reason is that the human reference genome changed.

What is the Human Reference Genome?

The Human Reference Genome is a genetic map against which everyone else is compared.  In essence, it’s an attempt to give every location in our genome an address, and to have them all line up on streets where they belong on a nice big chromosome by chromosome grid.

That’s easier said than done.  Let’s look at why and begin with a little history.

Hg refers to the human reference genome and 38 is the current version number, released in December of 2013.

The previous version was hg19, released in February of 2009.

This seems like a long time ago, but each version requires extensive resources to convert data from previous versions to the newer version.  Different versions are not compatible with each other.

You can read more about this here, here, here and here, if you really want to dig in.

Hg19, the version that we’ve been using until now, was based only on 13 anonymous volunteers from Buffalo, New York. Hg38 uses far more samples and resequences previously sequenced results as well. We learned a lot between 2009 when the previous version, hg19, was released and 2013 when hg38 was released.

Keeping in mind that people are genetically far more alike than different, sequencing allows most of the human genome to be mapped when the genomes of those reference individuals are compared in layers, stacked on top of each other.

The resulting composite reference map, regardless of the version, isn’t a reflection of any one person, but a combination of all of those people against which the rest of us are compared.

Areas of high diversity, in this case, Y SNPs, may differ from each other. It’s those differences that matter to us as genealogists.

In order to find those differences, we must be able to line up the genomes of the various people tested, on top of each other, so that we can measure from the locations that are the same.

Here’s an example.  All 4 people in this table above match exactly on locations 1-7, 9- 10 and 13-15.

Locations 8, 11 and 12 are areas that are more unstable, meaning that the people are not the same at that location, although they may not match each other, hence the different colored cells.

From this model, we know that we can align most people’s results on the green locations where everyone matches everyone else because we are all human.

The other locations may be the same or different, but they can’t be aligned reliably by relying on the map. You can read more about the complexity of this topic here and a good article, here.

A New Model

The challenge is that between 2009 and 2013, new locations were discovered in previously unmapped areas of the genome.

Think of genome locations as kids sitting in assigned seats side by side in a row.

Where do we put the newly discovered kids?

They have to crowd in someplace onto our existing map.

We have to add chairs between locations. The white rows below represent the newly discovered locations.

When we add chairs, the “addresses” of the kids currently sitting in chairs will change.  In fact, the address of everyone on the street might change because everyone has shifted.  Many of the actual kids will be the same, but some will be new, even though all of the kids will be referenced by new addresses.

This is a very simplified conceptual explanation of a complex process which isn’t simple at all.  In addition to addressing, this process has to deal with DNA insertions, deletions, STR markers which are repeats of segments, palindromic mutations as well as pseudo-autosomal regions of the Y chromosome. Additionally, not all reads or calls are valid, for a number of reasons. Due to all these factors, after the realignment is complete, analysis has to follow.

Suffice it to say that converting from one version to the next requires the data to be reanalyzed with a new filter which requires a massive amount of computational power.

Then, the wheat has to be sorted from the chaff.

Discovery

The conversion to hg38 has been a boon for discovery, already.  For example, Dr. Michael Sager, “Dr. Big Y” at Family Tree DNA has been busily working through the phylotree to see what the new alignment provides.

In November, he mentioned that he had discovered correct placement for a new haplogroup, high in the R1b tree, that joined together several subclades of U106.

In hg19, U106 had 9 subclades, all of which then branched downwards.

However, in hg38, utilizing the newly aligned genome, Michael can see that U106 has been reconfigured and looks like this instead.

Look at the difference!

  • Two new haplogroups have been placed in their proper location in the tree; Z2265 and BY30097.
  • A2150 has been repositioned.
  • Because of the placement of A2150 and Z2265, U106 now only has two direct branches.
  • S19589 has been moved beneath Z2265
  • The remaining 7 peach colored haplogroups in the old tree are now subclades of BY30097.

You may not know or realize that this shuffle occurred, but it has and it’s an important scientific discovery that corrects earlier versions of the phylotree.

Congratulations Dr. Sager!

So, how does the conversion to hg38 affect customers directly?

The Conversion

In or about October 2017, Family Tree DNA began their conversion to hg38. Keep in mind that no other vendor has to do this, because no other vendor provides testing at this level for Y DNA, combined with matching.

Not only that, but there is no funding for their investment in resources to do the conversion.  By that I mean that once you purchase the product, there is no annual subscription or anything else to fund development of this type.

Additionally, Family Tree DNA designed a new user interface for the enhanced Big Y which includes a new Big Y browser.

The initial conversion has been complete for some time, although tweaking is still occurring and some files are being reconverted when problems are discovered.  Now, the backlog of tests that accumulated during the conversion and during the holiday sale are being processed.

So, what does this mean to the consumer?  How do we work with the new results?  What has changed and what does all of this mean?

It’s an exciting time. We’re all waiting for new matches.

I’m going to step through the features and functions one at a time, explaining the new functionality and then what is different, and why.

First Look

On your personal page, you have Big Y Results and Big Y Matches.

Either selection takes you the same page, but with a different tab highlighted.

Named Variants

Named variants are SNPs that are already known and have been given SNP names.

At the bottom of the page, you can see that this person has 946 SNPs out of 77,722 currently on the tree.  Many SNPs on the tree are equivalent to each other.

The information about each SNP on this page shows that it’s derived, meaning it’s a mutation and not ancestral which is the original state of the DNA.

If you look closely, you’ll see that some of the Reference and Genotype values are the same.  You would logically expect them to be different.  These are genuine mutations, but they are listed as the same because in hg19, the reference model, which is a composite, is skewed towards haplogroup R.  In haplogroup R, these values are the same as the person tested (who is R-BY490), so while these are valid mutations on the tree of humanity, they are derived and found in all of haplogroup R. The same thing happens to some extent with all haplogroups because the reference sequence is a composite of all haplogroups.

The next column indicates whether the SNP has or hasn’t yet been placed on the Y tree.

The Reference column refers to the value at this address shown in the hg38 reference model, and the Genotype column shows the tester’s result at that location.

The confidence column shows the confidence level that Family Tree DNA has in this call. Let’s talk about confidence levels for a minute, and what they mean.

Confidence Levels

The Big Y test scans the Y chromosome, looking for specific blips at certain addresses.  Every location has a “normal” blip for the Y chromosome as determined by the reference model.  Any blips that vary from the reference model are flagged for further evaluation.

Blips can be caused by a mutation, a read error or a complex area of DNA, which is why there is a threshold for a minimum number of scans to find that same anomaly at any single location.

The area considered the “gold standard” portion of the Y chromosome which is useful genealogically is scanned between 55 and 80 times.  Then the scans are aligned and compared to each other, with the blips at various locations being reported.

The relevance of blips can vary by location and what is known as density in various regions.  In general, blips are not considered to be relevant unless they are recorded a minimum of 5 to 8 times, depending on the region of the Y chromosome.  At that level, Family Tree DNA reports them as a medium confidence call. High confidence calls are reported a minimum of 10 times.

Some individuals and third-party companies read the BAM files and offer analysis, often project administrators within haplogroup projects.  Depending on the circumstances, they may suggest that as few at 2 blips are enough to consider the blip a mutation and not a read error.  Therefore, some third-party analysis will suggest additional haplogroups not reported by Family Tree DNA. Project administrators often collaborate with Dr. Sager to coordinate the placement of SNPs on the tree.

Therefore, at Family Tree DNA:

  • You will see only medium and high confidence calls for SNPs.
  • Over time, your Unnamed Variants will disappear as they are named and become Named Variants with SNP names.
  • When Unnamed Variants become Named Variants, which are SNPs that have been named, they are eligible to be added to the Y tree.
  • If the SNP added to the Y tree is below your present terminal SNP, you may one day discover that you have a new terminal SNP, meaning new haplogroup, listed on your main page. If the new SNP is within 5 upstream of your terminal SNP, looking backward up the tree, you’ll see it appear in your mini-tree on your personal page and on your larger Haplogroup and SNP page.

Unnamed Variants

Unnamed variants are newer mutations that have not yet been named as SNPs.

In order for a mutation to be considered a SNP, in true genetics terms, it has to be found in over 1% of the population.  Otherwise, it’s considered a private, personal, family or clan mutation.

However, in reality, Family Tree DNA attempts to figure out which SNPs are being found often enough to warrant the assignment of a SNP number which means they can be placed on the haplotree of humanity, and which SNPs truly are going to be private “family mutations.”  Today, nearly all mutations found in 3 or more individuals that are considered high confidence calls are named as SNPs.

Both named and unnamed variants are a good thing.  New SNPs help expand and grow the tree.  Personal or family SNPs can be utilized in the same fashion as STR markers.  Eventually, as new SNPs are categorized and named, they will be moved from your Unnamed Variants page and added to your Named Variants page.

If you had results in the hg19 version, your unnamed variants will have changed.  Just like those kids sitting on the bleachers, your old variants are either:

  • Still here but with a new name
  • Have been given SNP names and are now on your Named Variants list

The great news is that you’ll very probably have new variants too, resulting from the new hg38 reference model and more accurate alignment.

If you’re really a die-hard and want to know which hg19 locations are now hg38 locations, you can do the address conversion here.  I am a die-hard but not this much of a die-hard, plus, I didn’t record the previous novel variant locations for my kits.  Dr. Sager who has run this program tells me that you only need to pay attention to the two drop down menus specifying the “original” and “new” assemblies when utilizing this tool.

Y Chromosome Browser Tool

You’ve probably already noticed the really new cool browser tool, positioned tantalizingly to the right of both results tabs.

Go ahead and click on either a SNP name or an unnamed variant.

Either one will cause a pop up box to open displaying the location you’ve selected in the Big Y browser.

Utilizing the new Y chromosome browser tool, you can see the number of times that a specific SNP was called as positive or negative during the scan of your Y DNA at that specific location.

To see an example, click on any SNP on the list under the SNP Name column.

The Y chromosome browser tool opens up at the location of the SNP you selected.

The SNP you selected is displayed in pink with a downward arrow pointing to the position of the SNP. The other pink locations display other nearby SNP positions.

See that one single pink blip to the far right in the example above?  That’s a good example of just one call, probably noise.  You can see the difference between that one single call and high confidence reads, illustrated by the columns of pink SNP reads lined up in a row.

You can click on any of your SNP positions, named or unnamed, to see more information for that specific SNP.

Pink indicates that a mutation, or derived value, was found at that location as compared to the ancestral value found in the reference model.

Blue rows and green rows indicate that the forward (blue) or reverse (green) strand was being read.

The intensity of the colors indicates the relative strength of the read confidence, where the most intense is the highest confidence.

The value listed at the top, T, A, C or G is the abbreviation for the ancestral reference nucleobase value found in the reference population at that genetic location, and the value highlighted in pink is the derived (mutated) value that you carry.

Confidence is a statistical value calculated based upon the number of scans, the relative quality of that part of the Y chromosome and the number of times that derived value was found during scanning.

I love this new tool.

I hope that in the next version, Family Tree DNA will include the ability to look at additional locations not on the list.

For example, I was recently working on a Personalized DNA Report where the SNP below the tester’s terminal SNP was not called one way or another, positive or negative.  I would have liked to view his results for that SNP location to see if he has any blips, or if the location read at all.

Matching

The third tab displays your Big Y matches and a mini-tree of your 5 SNPs at the end of your own personal branch of the haplotree.

Your terminal SNP determines the terminal (final or lowest) subbranch (on the Y-DNA haplotree) to which you belong.

On your mini-tree, your terminal SNP (R-BY490 above) is labeled YOU.

The number of people you match on those SNPs utilizing the new matching algorithm is displayed at each branch of the tree.

The matches shown above are the matches for this person’s terminal SNP. To see the people matching on the next branch above the terminal SNP, click on R-BY482.

The number listed beside these SNPs on your 5 step mini-tree is NOT the total number of people you match on that branch, only the number you match on that branch AFTER the matching algorithm is applied.

I put this in bold red, because based on the previous matching algorithm that managed to include everyone on your terminal SNP, it’s easy to presume the new version shows everyone in the system who matches you on that SNP – and it doesn’t necessarily.  If assume it does or expect that it will, you’re likely to be wrong. There is a significant amount of confusion surrounding this topic in the community.

New Matching Algorithm

The Family Tree DNA matching algorithm has changed substantially. It needed to be updated, as the old matching algorithm had been outgrown with the dramatic new number of SNPs discovered and placed on the phylotree. Family Tree DNA created the original matching software when the Big Y was new and it was time for a refresh. In essence, the Big Y testing and tree-building has been successful beyond anyone’s wildest dreams and the matching routine became a victim of its own success.

Previously, Family Tree DNA used a static list of somewhere around 6,000 SNPs as compared to over 350,000 today, of which more than 78,000 have been placed on the tree. By the way, this SNP number grows with every batch of Big Y results because new SNPs are always found.

The previous threshold for mismatches was 4 SNPs. As time went on, this combination of a growing tree and a static SNP list caused increasingly irrelevant matches.

For example, in some instances, haplogroup U106 people matched haplogroup P312 people, two main branches of the R1b haplotree, because when compared to the old SNP list, they had less than 4 SNP mismatches.

The new Big Y matching routine expands as the new tree grows, and isn’t limited.  This means that people who were shown as matches to haplogroups far upstream (e.g. P312/U106), whose common ancestor lived many thousands of years ago, won’t be shown as matches at that level anymore.

Many people had hundreds of matches and complained that they were being shown matches so distant in time that the information was useless to them.

The previous Big Y version match criteria was:

  • 4 or less differences in Known SNPs (now Named Variants.)
  • In addition, you could have unlimited differences in Unnamed Variants, then called Novel Variants.

Family Tree DNA has attempted to make the matching algorithm more genealogically relevant by applying a different type of threshold to matching.

In the current Big Y version, a person is considered a match to you if they have BOTH of the following:

  • 30 or fewer differences in total SNPs (named and unnamed variants combined.)
  • Their haplogroup is downstream from your terminal SNP haplogroup or downstream from your four closest parent haplogroups, meaning any of the 5 haplogroups shown on your 5 step mini-tree.

Here’s the logic behind the new matching algorithm threshold.

SNP mutations happen on the average of one every 100 years.  This number is still discussed and debated, but this estimate is as good as any.

If your common ancestor through two men had two sons, 1500 years ago, and each line incurred 1 mutation every hundred years, at the end of 1500 years, the number of mutations between the two men would be approximately 30.

Family Tree DNA felt that 1500 years was a reasonable cutoff for a genealogical timeframe, hence the new matching threshold of 30 mutations difference.

The new match criteria is designed to reflect your matches that are most closely related to you.  In other words, the people on your match list should be related to you within the last approximate 1500 years, and people not on your match list who have taken the Big Y are separated from you by at least 30 mutations.

There may be people in the data base that match you on your terminal SNP and any or all of the SNPs shown on your mini-tree, but if you and they are separated by more than 30 differences (including both named and unnamed variants) on the Y chromosome, they will not be shown as a match.  

By clicking on the SNP name on your mini-tree, at right, you can see all of the people who match you with less than 30 differences total at each level, and who carry that particular Named Variant (SNP). The example shown above show this person’s matches on their terminal SNP. If they were to click on BY482, the next step up, they would then see everyone on their match list who is positive for that SNP.

On your match page, you can search for a specific surname, nonmatching variants or match date.

The Shared Variants column is the total number of shared variants you have with the match in question.  According to the lab at Family Tree DNA, this number very high because it is reflective of many ancient variants.

You can also download your data from this page into a spreadsheet.

The Biggest Differences

What you don’t receive today, that you did receive before, is a comprehensive list of who you match on your terminal and upstream SNPs.

For example, I was working with someone’s results this week.  They had no matches, as shown below.

However, when I went to the relevant haplogroup project page, I discovered that indeed, there are at least 4 additional individuals who do share the same terminal SNP, but the tester would never know that from their Big Y results alone, if they didn’t check the project results page.

Of course, it’s unlikely that every person who takes the Big Y test joins a Y DNA project, or the same Y DNA project.  Even though projects will show some matches, assuming that the administrator has the project grouped in this manner, there is no guarantee you are seeing all of your terminal SNP matches.

Project administrators, who have been instrumental in building the tree can also no longer see who matches on terminal SNPs, at least not if they are separated by more than 30 mutations. This hampers their ability to build the Y tree.

This matching change makes it critical that people join projects AND make their results viewable to project members as well as publicly.  Most people don’t realize that the default when joining projects is that ONLY project members can see their results in the project. In other words, the results are available in the public project, like the screenshot above.

You can read more about Family Tree DNA’s privacy settings here.

Another result of the matching algorithm change is that in some cases, one man may match a second man, but the second man does not show up on the first man’s match list.

I know that sounds bizarre, but in the Estes project, we have that exact scenario.

The chart above shows that none of the Estes Big Y participants match kit number 166011, also an Estes male, but kit 166011 does show matches to all of those Estes men.

Kit 166011 is the one to the far right on the pedigree chart above, and he is descended from a different son of Robert born in 1555 than the rest of the men.  Counting from kit 166011 to Robert born in 1555 is 12 generations.  Counting from kits 244708 and 199378 to Robert is 10 generations, so a total of 22 generations between those men.

Kits 366707, 9993 and 13805 are 11 generations from the common ancestor, so a total of 23 generations.  Not only are these genealogically relevant, they carry the same surname.

The average of 30 mutations reaching to 1500 years doesn’t work in this case.  The cutoff was about 1555, or 462 years, not 1500 years – so the matching algorithm failed at 30% of the estimated time it was supposed to cover.  I guess this just goes to prove that mutations really don’t happen on any type of a reliable schedule – and the average doesn’t always pertain to individual family circumstances.

If you’re wondering if these men match on STR markers, they do.

In this case, the Big Y doesn’t show matches in a timeframe that STR markers do – the exact opposite of what we would expect.

One of the benefits of the Big Y, previously, was the ability to view people of other surnames who matched your SNP results.  This ability to peer back into time informed us of where our ancestors may have been prior to where we found them.  While this isn’t genealogy, per se, it’s certainly family history.

A good case in point is the Scottish clans and how men with different surnames may be related.

As a family historian I want to know who I match on my terminal SNP and the direct upstream SNPs so I can walk this line back in time.

What’s Coming

At the conference in Houston in November, Elliott Greenspan discussed a new direction for the Big Y in 2018.  The new feature that all Big Y testers are looking forward to is the addition of STRs beyond the 111 marker panels, extracted from the Big Y as a standard product offering. Meaning free for Big Y testers.

The 111 and lower panels will continue to be tested on their current Sanger platform.  Analysis of more than 3700 samples in the data base that have both the Big Y and 111 markers indicate that only 72 of the 111 STR markers can be reliably and consistently extracted from the Big Y NGS scan data. The last thing we want is unreliable NGS data being compared to our Sanger sequenced STR values. We need to be able to depend on those results as always being reliable and comparable to each other. Therefore, only STR markers above 111 will be extracted from the Big Y and the original 111 STR markers will continue to be sold in panels, the same as today.

However, because of the nature of scanning DNA as opposed to directly testing locations, all of the markers above 111 will not be available for everyone. Some marker locations will fail to read, or fail to read reliably.  These won’t necessarily be the same markers, but read failure will apply to some markers in just about every individual’s scan.  Therefore, these additional STR markers will be supplemental to the regular 111 STR markers. You get what you get.

How many additional markers will be available through Big Y?  That hasn’t been finalized yet.

Elliott said that in order to reliably obtain 289 additional markers, they need to attempt to call 315.  To get 489, they have to attempt more than 600, and many are less useful.

Therefore, speculating, I’d guess that we’ll see someplace between 289 and 489, the numbers Elliott mentioned.

Are you salivating yet?

Given that the webpage and display tools have to be redesigned for both individuals’ results, project pages and project administrators’ tools, I’d guess that we won’t see this addition until after they get the kinks worked out of the hg38 conversion and analysis.

It’s nice to know that it’s on the way though. Something to look forward to later in 2018.

In Summary

I know that the upgrade to hg38 had to be done, but I hated to see it.  These things never go smoothly, no matter who you are and this was a massive undertaking.

I’m glad that Family Tree DNA is taking this opportunity to innovate and provide the community with the nifty new Y DNA browser.

I’m also grateful that they listen to their customers and make an effort to implement changes to help us along the genealogy path.

However, sometimes things fall into the well of unintended consequences.  I think that’s what’s happening with the new matching routine. I know that they are continuing to work to tweek the knobs and refine the results, so you’re likely to see changes over the next few months. It’s not like there was a pattern or recipe anyplace.  This has never been done before.

Here’s a list of changes and updates I’d suggest to improve the new hg38 Big Y experience:

  • In addition to threshold matching, an option for direct SNP tree matching through the 5 SNPs shown on the participant’s 5 step mini-tree, purely based on haplotree matching. This second option would replace the functionality lost with the 30-mutation threshold matching today.
  • A matches map of the most distant ancestors at each level of matching for both threshold matching and SNP tree matching.
  • An icon indicating whether a Big Y match is an STR match and which level of STR panel testing the match has completed. This means that we could tell at a glance that a Big Y match has tested to 111 markers, but is only a match at 12.
  • An icon indicating if the Big Y match has also taken the Family Finder test, and if they are a match.
  • An icon on STR matches pages indicating that a match has taken a Big Y test and if they are a match.
  • Ability to query through the Big Y browser to SNP locations not on the list of named or unnamed variants.
  • Age estimates for haplogroups.

If you are seeing Big Y results that you find unusual or confusing, please notify Family Tree DNA support. There is a contact link with a form at the bottom of your personal page.  Family Tree DNA needs to be aware of problems and also of customer’s desires.

Family Tree DNA has indicated that they are soliciting customer feedback on the new Big Y matching and tools.

Please also join a relevant haplogroup project as well as a surname project, if you haven’t already. Here’s an article, What Project Do I Join?, to help you find relevant projects.

If you think you have an unnamed variant that should be named and placed on the phylotree, your haplogroup project administrator is the person who will work with you to verify that the unnamed variant is a good candidate and submit the unnamed variant to Family Tree DNA for naming.

If you are a project administrator having issues, questions or concerns, you can contact the group projects team at groups@ftdna.com.  Be sure that this address is in the “to” field, not the “cc” field as the e-mail will bounce otherwise.

Don’t forget that you can reference the Family Tree DNA Learning Center about your Big Y results.

Thank you to Dr. Sager for his assistance with this article.

_____________________________________________________________________

Standard Disclosure

This standard disclosure appears at the bottom of every article in compliance with the FTC Guidelines.

Hot links are provided to Family Tree DNA, where appropriate.  If you wish to purchase one of their products, and you click through one of the links in an article to Family Tree DNA, or on the sidebar of this blog, I receive a small contribution if you make a purchase.  Clicking through the link does not affect the price you pay.  This affiliate relationship helps to keep this publication, with more than 900 articles about all aspects of genetic genealogy, free for everyone.

I do not accept sponsorship for this blog, nor do I write paid articles, nor do I accept contributions of any type from any vendor in order to review any product, etc.  In fact, I pay a premium price to prevent ads from appearing on this blog.

When reviewing products, in most cases, I pay the same price and order in the same way as any other consumer. If not, I state very clearly in the article any special consideration received.  In other words, you are reading my opinions as a long-time consumer and consultant in the genetic genealogy field.

I will never link to a product about which I have reservations or qualms, either about the product or about the company offering the product.  I only recommend products that I use myself and bring value to the genetic genealogy community.  If you wonder why there aren’t more links, that’s why and that’s my commitment to you.

Thank you for your readership, your ongoing support and for purchasing through the affiliate link if you are interested in making a purchase at Family Tree DNA, or one of the affiliate links below:

Affiliate links are limited to:

Concepts – DNA Recombination and Crossovers

What is a crossover anyway, and why do I, as a genetic genealogist, care?

A crossover on a chromosome is where the chromosome is cut and the DNA from two different ancestors is spliced together during meiosis as the DNA of the offspring is created when half of the DNA of the two parents combines.

Identifying crossover locations, and who the DNA that we received came from is the first step in identifying the ancestor further back in our tree that contributed that segment of DNA to us.

Crossovers are easier to see than conceptualize.

Viewing Crossovers

The crossover is the location on each chromosome where the orange and black DNA butt up against each other – like a splice or seam.

In this example, utilizing the Family Tree DNA chromosome browser, the DNA of a grandchild is compared to the DNA of a grandparent. The grandchild received exactly 50 percent of her father’s DNA, but only the average of 25% of the DNA of each of her 4 grandparents. Comparing this child’s DNA to one grandmother shows that she inherited about half of this grandmother’s DNA – the other half belonging to the spousal grandfather.

  • The orange segments above show the locations where the grandchild matches the grandmother.
  • The black sections (with the exception of the very tips of the chromosomes) show locations where the grandchild does not match the grandmother, so by definition, the grandchild must match the grandfather in those black locations (except chromosome tips).
  • The crossover location is the dividing line between the orange and black. Please note that the ends of chromosomes are notoriously difficult and inconsistent, so I tend to ignore what appear to be crossovers at the tips of chromosomes unless I can prove one way or the other. Of the 22 chromosomes, 16 have at least one black tip. In some cases, like chromosome 16, you can’t tell since the entire chromosome is black.
  • Ignore the grey areas – those regions are untested because they are SNP poor.

We know that the grandchild has her grandmother’s entire X chromosome, because the parent is a male who only inherited an X chromosome from his mother, so that’s all he had to give his daughter. The tips of the X chromosome are black, showing that the area is not matching the mother, so that region is unstable and not reported.

It’s also interesting to note that in 6 cases, other than the X chromosome, the entire chromosome is passed intact from grandparent to grandchild; chromosomes 4, 11, 16, 20, 21 and 22.

Twenty-six crossovers occurred between mother and son, at 5cM.  This was determined by comparing the DNA of mother to son in order to ascertain the actual beginning and end of the chromosome matching region, which tells me whether the black tips are or are not crossovers by comparing the grandchild’s DNA to the grandmother.

For more about this, you might want to read Concepts – Segment Survival – Three and Four Generation Phasing.

Before going on, let’s look at what a match between a parent and child looks like, and why.

Parent/Child Match

If you’re wondering why I showed a match between a grandchild and a grandparent, above, instead of showing a match between a child and a parent, the chromosome browser below provides the answer.

It’s a solid orange mass for each chromosome indicating that the child matches the parent at every location.

How can this be if the child only inherits half of the parent’s DNA?

Remember – the parent has two chromosomes that mix to give the child one chromosome.  When comparing the child to the parent, the child’s single chromosome inherited from the parent matches one of the parent’s two chromosomes at every address location – so it shows as a complete match to the parent even though the child is only matching one of the parent’s two of chromosome locations.  This isn’t a bug and it’s just how chromosome browsers work. In other words, the “other ” chromosome that your parents carry is the one you don’t match.

The diagram below shows the mother’s two copies of chromosome 1 she inherited from her father and mother and which section she gave to her child.

You can see that the mother’s father’s chromosome is blue in this illustration, and the mother’s mother’s chromosome is pink.  The crossover points in the child are between part B and C, and between part C and D.  You can clearly see that the child, when compared to the mother, does in fact match the mother in all locations, or parts, 3 blue and 1 pink, even though the source of the matching DNA is from two different parents.

This example shows the child compared to both parents, so you can see that the child does in fact match both parents on every single location.

This is exactly why two different matches may match us on the same location, but may not match each other because they are from different sides of our family – one from Mom’s side and one from Dad’s.

You can read more about this in the article, One Chromosome, Two Sides, No Zipper – ICW and the Matrix.

The only way to tell which “sides” or pieces of the parent’s DNA that the child inherited is to compare to other people who descend from the same line as one of the parents.  In essence, you can compare the child to the grandparents to identify the locations that the child received from each of the 4 grandparents – and by genetic subtraction, which segments were NOT inherited from each grandparent as well, if one grandparent happens to be missing.

In our Parental Chromosome pink and blue diagram illustration above, the child did NOT inherit the pink parts A, B and D, and did not inherit the blue part C – but did inherit something from the parent at every single location. They also didn’t inherit an equal amount of their grandparents pink and blue DNA. If they inherited the pink part, then they didn’t inherit the blue part, and vice versa for that particular location.

The parent to child chromosome browser view also shows us that the very tip ends of the chromosomes are not included in the matching reports – because we know that the child MUST match the parent on one of their two chromosomes, end to end. The download or chart view provides us with the exact locations.

This brings us to the question of whether crossovers occur equally between males and female children.  We already know that the X chromosome has a distinctive inheritance pattern – meaning that males only inherit an X from their mothers.  A father and son will NEVER match on the X chromosome.  You can read more about X chromosome inheritance patterns in the article, X Marks the Spot.

Crossovers Differ Between Males and Females

In the paper Genetic Analysis of Variation in Human Meiotic Recombination by Chowdhury, et al, we learn that males and females experience a different average number of crossovers.

The authors say the following:

The number of recombination events per meiosis varies extensively among individuals. This recombination phenotype differs between female and male, and also among individuals of each gender.

Notably, we found different sequence variants associated with female and male recombination phenotypes, suggesting that they are regulated by different genes.

Meiotic recombination is essential for the formation of human gametes and is a key process that generates genetic diversity. Given its importance, we would expect the number and location of exchanges to be tightly regulated. However, studies show significant gender and inter-individual variation in genome-wide recombination rates. The genetic basis for this variation is poorly understood.

The Chowdhury paper provides the following graphs. These graphs show the average number of recombinations, or crossovers, per meiosis for each of two different studies, the AGRE and the FHS study, discussed in the paper.

The bottom line of this paper, for genetic genealogists, is that males average about 27 crossovers per child and females average about 42, with the AGRE study families reporting 41.1 and the FHS study families reporting 42.8.

I have been collaborating with statistician, Philip Gammon, and he points out the following:

Male, 22 chromosomes plus the average of 27 crossovers = an average of 49 segments of his parent’s DNA that he will pass on to his children. Roughly half will be from each of his parents. Not exactly half. If there are an odd number of crossovers on a chromosome it will contain an even number of segments and half will be from each parent. But if there are an even number of crossovers (0, 2, 4, 6 etc.) there will be an odd number of segments on the chromosome, one more from one parent than the other.

The average size of segments will be approximately:

  • Males, 22 + 27 = 49 segments at an average size of 3400 / 49 = 69 cM
  • Females, 22 + 42 = 64 segments at an average size of 3400 / 64 = 53 cM

This means that cumulatively, over time, in a line of entirely females, versus a line of entirely males, you’re going to see bigger chunks of DNA preserved (and lost) in males versus females, because the DNA divides fewer times. Bigger chunks of DNA mean better matching more generations back in time. When males do have a match, it would be likely to be on a larger segment.

The article, First Cousin Match Simulations speaks to this as well.

Practically Speaking

What does this mean, practically speaking, to genetic genealogists?

Few lines actually descend from all males or all females. Most of our connections to distant ancestors are through mixtures of male and female ancestors, so this variation in crossover rates really doesn’t affect us much – at least not on the average.

It’s difficult to discern why we match some cousins and we don’t match others. In some cases, rather than random recombination being a factor, the actual crossover rate may be at play. However, since we only know who we do match, and not who tested and we don’t match, it’s difficult to even speculate as to how recombination affected or affects our matches. And truthfully, for the application of genetic genealogy, we really don’t care – we (generally) only care who we do match – unless we don’t match anyone (or a second cousin or closer) in a particular line, especially a relatively close line – and that’s a horse of an entirely different color.

To me, the burning question to be answered, which still has not been unraveled, is why a difference in recombination rates exists between males and females. What processes are in play here that we don’t understand? What else might this not-yet-understood phenomenon affect?

Until we figure those things out, I note whether or not my match occurred through primarily men or women, and simply add that information into the other data that I use to determine match quality and possible distance.  In other words, information that informs me as to how close and reasonable a match is likely to be includes the following information:

  • Total amount of shared DNA
  • Largest segment size
  • Number of matching segments
  • Number of SNPs in matching segment
  • Shared matches
  • X chromosome
  • mtDNA or Y DNA match
  • Trees – presence, absence, accuracy, depth and completeness
  • Primarily male or female individuals in path to common ancestor
  • Who else they match, particularly known close relatives
  • Does triangulation occur

It would be very interesting to see how the instances of matches to a certain specific cousin level – say 3rd cousins (for example), fare differently in terms of the average amount of shared DNA, the largest segment size and the number of segments in people descended from entirely female and entirely male lines. Blaine Bettinger, are you listening? This would be a wonderful study for the Shared cM Project which measures actual data.

Isn’t the science of genetics absolutely fascinating???!!!

______________________________________________________________________

Standard Disclosure

This standard disclosure will now appear at the bottom of every article in compliance with the FTC Guidelines.

Hot links are provided to Family Tree DNA, where appropriate. If you wish to purchase one of their products, and you click through one of the links in an article to Family Tree DNA, or on the sidebar of this blog, I receive a small contribution if you make a purchase. Clicking through the link does not affect the price you pay. This affiliate relationship helps to keep this publication, with more than 850 articles about all aspects of genetic genealogy, free for everyone.

I do not accept sponsorship for this blog, nor do I write paid articles, nor do I accept contributions of any type from any vendor in order to review any product, etc. In fact, I pay a premium price to prevent ads from appearing on this blog.

When reviewing products, in most cases, I pay the same price and order in the same way as any other consumer. If not, I state very clearly in the article any special consideration received. In other words, you are reading my opinions as a long-time consumer and consultant in the genetic genealogy field.

I will never link to a product about which I have reservations or qualms, either about the product or about the company offering the product. I only recommend products that I use myself and bring value to the genetic genealogy community. If you wonder why there aren’t more links, that’s why and that’s my commitment to you.

Thank you for your readership, your ongoing support and for purchasing through the affiliate link if you are interested in making a purchase at Family Tree DNA.

Imputation Analysis Utilizing Promethease

We know in the genetics industry that imputation is either coming or already here for genetic genealogy. I recently wrote two articles, here and here, explaining imputation and its (apparent) effects on matching – or at least the differences between vendors who do and don’t utilize imputation on the segments that are set forth as matches.

I will be writing shortly about my experience utilizing DNA.Land, a vendor who encourages testers to upload their files to be shared with medical researchers. In return, DNA.Land provides matching information and ethnicity – but they do impute results that you don’t have based on“typical” DNA that is generally inherited with the DNA you do have.

Aside from my own curiosity and interest in health, I have been attempting to determine the relative accuracy of imputation.

Promethease is a third party site that provides consumers who upload their autosomal DNA files with published information about their SNPs, mutations, either bad, good or neither, meaning just information. This makes Promethease the perfect avenue for comparing the accuracy of the imputed data provided by DNA.Land compared against the data provided by Promethease generated from files from vendors who do not impute.

Even better, I can directly compare the autosomal file from Family Tree DNA that I uploaded to DNA.Land with my resulting DNA.Land file after DNA.Land imputed another 38 million locations. I can also compare the DNA.Land results to an extensive exome test that provided results for some 50 million locations.

Uploading all of the files from various testing vendors separately to Promethease allows me to see which of the mutations imputed by DNA.Land are accurate when compared to actual DNA tests, and if the imputed mutations are accurate when the same location was tested by any vendor.

In addition to the typical genetic genealogy vendors, I’ve also had my DNA exome sequenced, which includes the 50 million locations in humans most likely to mutate.  This means those locations should be the locations most likely to be imputed by DNA.Land.

Finally, at Promethease, I can combine my results from all the vendors where I actually tested to provide the greatest coverage of actually tested locations, and then compare to DNA.Land – providing the most comprehensive comparison.

I will utilize the testing vendors’ actual results to check the DNA.Land imputed results.

Let’s see what the results produce.

The Test Process

The method I used for this comparison was to upload my Family Tree DNA autosomal raw data file to DNA.Land. DNA.Land then took the 700,000+ locations that I did test for at Family Tree DNA, and imputed more than 38 million additional locations, raising my tested and imputed number of locations to about 39 million.

Then, I downloaded and uploaded my huge DNA.Land file, utilizing the Promethease instructions.

In order to do a comparison against the imputed data that DNA.Land provided, I uploaded files from the following vendors individually, one at a time, to Promethease to see which versions of the files provided which results – meaning which mutations the files produced by actual testing at vendors could confirm in the DNA.Land imputed results.

  • DNA.Land (imputed)
  • Genos – Exome testing of 50 million medically relevant locations
  • Ancestry V1 test
  • Ancestry V2 test
  • Family Tree DNA
  • 23andMe V3 test
  • 23andMe V4 test
  • Combined file of all non-imputed vendor files

Promethease provides a wonderful feature that enables users to combine multiple vendors’ files into one run. As a final test, I combined all of my non-imputed files into one run in order to compare all of my non-imputed results, together, with DNA.Land’s imputed results.

Promethease provides results that fall into 3 categories:

  • Bad – red
  • Good – green
  • Grey – “not set” – neither bad nor good, just information

Promethease does not provide diagnoses of any form, just information from the published literature about various mutations and genetic markers and what has been found in research, with links to the sources through SNPedia.

Results

I compiled the following chart with the results of each individual file, plus a combined file made up of all of the non-imputed files.

The results are quite interesting.

The combined run that included all of the vendors files except for DNA.Land provided more “bad” results than the imputed DNA.Land file. 

I expected that the Genos exome test would have covered all of the locations tested by the three genetic genealogy vendors, but clearly not, given that the combined run provides more results than the Genos exome run by itself. In fact, the total locations reported is 80,607 for the combined run and the Genos run alone was only 45,595.

DNA.Land only imputed 34,743 locations that returned results.

Comparison for Accuracy

Now, the question is whether the DNA.Land imputed results are accurate.

Due to the sheer number of results, I focused only on the “bad” results, the ones that would be most concerning, to get an idea of how many of the DNA.Land results were tested in the original uploaded file (from FTDNA) and how many were imputed. Of the imputed locations, I determined how many are accurate by comparing the DNA.Land results to the combined testing results. My hope, is, of course, that most of the locations found in the DNA.Land imputed file are also to be found in one of the files tested at the vendors, and therefore covered in the combined file run.

I combined my results from the following 3 runs into a common spreadsheet, color coding each result differently:

  • First, I wanted to see the locations reported as “bad” that were actually tested at FTDNA. By comparing the FTDNA locations with the DNA.Land imputed file, we know that DNA.Land was NOT imputing those locations, and conversely, that they WERE imputing the rest of the locations.
  • Second, I wanted to know if locations imputed by DNA.Land and reported as “bad” had been tested by any testing company, and if DNA.Land’s imputation was accurate as compared to an actual test.

You can read more about how Promethease reports results, here.

I’m showing two results in the spreadsheet example, below.

White row=FTDNA test result
Yellow row =DNA.Land result
Blue row=combined test result

These two examples show two mutations that are ranked as “bad” for the same condition. This result really only tells me that I metabolize some things slower than other people. Reading the fine print tells me this as well:

The proportion of slow and rapid metabolizers is known to differ between different ethnic populations. In general, the slow metabolizer phenotype is most prevalent (>80%) in Northern Africans and Scandinavians, and lowest (5%) in Canadian Eskimos and Japanese. Intermediate frequencies are seen in Chinese populations (around 20% slow metabolizers), whereas 40 – 60% of African-Americans and most non-Scandinavian Caucasians are slow metabolizers.[PMID 16416399]

Many of you are probably slow metabolizers too.

I used this example to illustrate that not everything that is “bad” is going to keep you awake at night.

The first mutation, gs140 is found in the DNA.Land file, but there is no corresponding white row, representing the original Family Tree DNA report, meaning that DNA.Land imputed the result. GS140 is, however, tested by some vendor in the combined file. The results do match (verified by actually comparing the results individually) and therefore, the DNA.Land imputation was accurate as noted in the DNA.Land Analysis column at far right.

In the second example, gs154 is reported by DNA.Land, but since it’s also reported by Family Tree DNA in the white row, we know that this value was NOT imputed by DNA.Land, because this was part of the originally uploaded file. Therefore, in the Analysis column, I labeled this result as “tested at FTDNA.”

Analysis

I analyzed each of the rows of “bad” results found in the DNA.Land file by comparing them first to the FTDNA file and then the Combined file. In some cases, I needed to return to the various vendor results to see which vendor had done the testing on a specific location in order to verify the result from the individual run.

So, how did DNA.Land do with imputing data as compared with actual tested results?

# Results % Comment
Tested, not Imputed 171 38.6 This “bad” location was tested at FTDNA and uploaded, so we know it was reported accurately at DNA.Land and not imputed.
Total Imputed* 272 61.4 Meaning total of “bad” results not tested at FTDNA, so not uploaded to DNA.Land, therefore imputed.
Imputed Correctly 259 95.22 This result was verified to match a tested location in the combined run.
Imputed, but not tested elsewhere 6 2.21 Accuracy cannot be confirmed.
Conflict 3 1.10 DNA.Land results cannot be verified due to an error of some sort – two of these three are probably accurate.
Imputed Incorrectly 4 1.47 Confirmed by the combined run where the location was actually tested at multiple vendor(s).
Not reported, and should have been 1 0.37 4 other vendor tests showed this mutation, including FTDNA which was uploaded to DNA.Land. Therefore these locations should have been reported by the DNA.Land file.

*The total number of “bad” results was 443, 171 that were tested and 272 that were imputed. Note that the percentages of imputations shown below the “Total Imputed” number of 272 are calculated based on the number of locations imputed, not on the total number of locations reported.

Concerns, Conflicts and Errors

It’s worth noting that my highest imputed “bad” risk from DNA.Land was not tested elsewhere, so cannot be verified, which concerns me.

On the three results where a conflict exists, all 3 locations were tested at multiple other vendors, and the results at the other vendors where the results were actually tested show different results from each other, which means that the DNA.Land result cannot be verified as accurate. Clearly, an error exists in at least one of the other tests.

In one conflict case, this error has occurred at 23andMe on either their V3 or V4 chip, where the results do not match each other.

In a second conflict case, two of the other vendors agree and the DNA.Land imputation is likely accurate, as it matches 2 of the three other vendor tests.

In the third conflict case, the Ancestry V2 test confirms one of the 23andMe results, which matches the DNA.Land results, so the DNA.Land result is likely accurate.

Of the 4 results that were confirmed to be imputed incorrectly, all locations were tested at multiple vendors. In two cases, the location was confirmed on two other tests and in the other two cases, the location was tested at three vendors. The testing vendor’s results all matched each other.

Summary

Overall, given the problems found with both DNA.Land and MyHeritage, who both impute, relative to genetic genealogy matching, I was surprised to find that the DNA.Land imputed health results were relatively accurate.

I expected the locations reported in the FTDNA file to be reported accurately by DNA.Land, because that data was provided to them. In one case, it was not.

Of the 272 “bad” results imputed, 259, or 95.22% could be verified as accurate.

Six could not be verified, and three were in conflict, but of those, it’s likely that two of the three were imputed accurately by DNA.Land. The third can’t be verified. This totals 3.31% of the imputed results that are ambiguous.

Only 1.47% were imputed incorrectly. If you add the .37% for the location that was not reported and should have been, and make the leap of assumption that the one of three in conflict is in error, DNA.Land is still just over a 2% confirmed error rate.

I can see why Illumina would represent to the vendors that imputation technology is “very accurate.” “Very” of course is relative, pardon the pun, in genetic genealogy, to how well matching occurs, not only when the new GSA chip is compared to another GSA chip, but when the new GSA version is compared to the older OmniExpress version. For backards compatibility between the chip versions, imputation must be utilized. Thanks a lot Illumina (said in my teenage sarcastic voice).

Since DNA.Land accepts files from all the vendors on all chips, for DNA.Land to be able to compare all locations in all vendors’ files against each other, the “missing” data in each file must be imputed. MyHeritage is doing something similar (having hired one of the DNA.Land developers), and both vendors have problems with genetic genealogy matching.

This begs the question of why the matching is demonstrably so poor for genetic genealogy. I’ve written about this phenomenon here, Kitty Cooper wrote about it here and Leah Larkin here.

Based on this comparison, each individual DNA.Land imputed file would contain about a 2% error rate of incorrectly imputed data, assuming the error rate is the same across the entire file, so a combined total of 4% for two individuals, if you’re just looking at individual SNPs. Perhaps entire segments are being imputed incorrectly, given that we know that DNA is inherited in segments. If that is the case, and these individual SNPs are simply small parts of entire segments that are imputed incorrectly, they might account for an equal number of false positive matches. In other words, if 10 segments are imputed incorrectly for me, that’s 10 segments reporting false positive matches I’ll have when paired against anyone who receives the same imputed data. However, that doesn’t explain the matches that are legitimate (on tested segments) and aren’t found by the imputing vendors, and it doesn’t explain an erroneous match rate that appears to be significantly higher than the 2-4% per cent found in this comparison.

I’ll be writing about the DNA.Land matching comparison experience shortly.

I would strongly prefer that medical research be performed on fully tested individuals. I realize that the cost of encouraging consumers to upload their data, and then imputing additional information is much less expensive than actual testing. However, accuracy is an issue and a 2% error rare, if someone is dealing with life-saving and life-threatening research could be a huge margin of error, from the beginning of the project, based on faulty imputation – which could be eliminated by simply testing people. This seems like an unnecessary risk and faulty research just waiting to happen. This error rate is on top of the actual sequencing error rate, but sequencing errors will be found in different locations in individuals, not on the same imputed segment assigned to multiple people in population groups. Imputation errors could be cumulative in one location, appearing as a hot spot when in reality, it’s an imputation error.

As related to genetic genealogy, I don’t think imputation and genetic genealogy are good bedfellows. DNA.Land’s matching was even worse when it was initially introduced, which is one reason I’ve waited so long to upload and write about the service.

Unfortunately, with Illumina obsoleting the OmniExpress chip, we’re not going to have a choice, sooner than later. All vendors who utilized the OmniExpress chip are being forced off, either onto the GSA chip or to an Exome or full sequence chip. The cost of sequencing for anything other than the GSA chip is simply more than the genetic genealogy market will stand, not to mention even larger compatibility issues. My Genos Exome test cost $499 just a few months ago and still sells for that price today.

The good news is that utilizing imputation, we will still receive matches, just less accurate matches when comparing the new chip to older versions, and when using imputation.

New testers will never know the difference. Testers not paying close attention won’t notice or won’t realize either. That leaves the rest of us “old timers” who want increased accuracy and specification, not less, flapping in the wind along with the vendors who don’t sell our test results into the medical arena and have no reason to move to the new GSA platform other than Illumina obsoleting the OmniExpress chip.

Like I said, thanks Illumina.

Imputation Matching Comparison

In a future article, I’ll be writing about the process of uploading files to DNA.Land and the user experience, but in this article, I want to discuss only one topic, and that’s the results of imputation as it affects matching for genetic genealogy. DNA.Land is one of three companies known positively to be using imputation (DNA.Land, MyHeritage and LivingDNA), and one of two that allows transfers and does matching for genealogy

This is the second in a series of three articles about imputation.

Imputation, discussed in the article, Concepts – Imputation, is the process whereby your DNA that is tested is then “expanded” by inferring results you don’t have, meaning locations that haven’t been tested, by using information from results you do have. Vendors have no choice in this matter, as Illumina, the chip maker of the DNA chip widely utilized in the genetic genealogy marketspace has obsoleted the prior chip and moved to a new chip with only about 20% overlap in the locations previously tested. Imputation is the methodology utilized to attempt to bridge the gap between the two chips for genetic genealogy matching and ethnicity predications.

Imputation is built upon two premises:

1 – that DNA locations are inherited together

2 – that people from common populations share a significant amount of the same DNA

An example of imputation that DNA.Land provides is the following sentence.

I saw a blue ca_ on your head.

There are several letters that are more likely that others to be found in the blank and some words would be more likely to be found in this sentence than others.

A less intuitive sentence might be:

I saw a blue ca_ yesterday.

DNA.Land doesn’t perform DNA testing, but instead takes a file that you upload from a testing vendor that has around 700,000 locations and imputes another 38.3 million variants, or locations, based on what other people carry in neighboring locations. These numbers are found in the SNPedia instructions for uploading DNA.Land information to their system for usage with Promethease.

I originally wrote about Promethease here, and I’ll be publishing an updated article shortly.

In this article, I want to see how imputation affects matching between people for genetic genealogy purposes.

Genetic Genealogy Matching

In order to be able to do an apples to apples comparison, I uploaded my Family Tree DNA autosomal file to DNA.Land.

DNA.Land then processed my file, imputed additional values, then showed me my matches to other people who have also uploaded and had additional locations imputed.

DNA.Land has just over 60,000 uploads in their data base today. Of those, I match 11 at a high confidence level and one at a speculative level.

My best match, meaning my closest match, Karen, just happened to have used her GedMatch kit number for her middle name. Smart lady!

Karen’s GedMatch number provided me with the opportunity to compare our actual match information at DNA.Land, then also at GedMatch, then compare the two different match results in order to see how much of our matching was “real” from portions of our tested kits that actually match, and what portion of our DNA matches as a result of the DNA.Land imputation.

At DNA.Land, your match information is presented with the following information:

  • Relationship degree – meaning estimated relationship
  • # shared segments – although many of these are extremely small
  • Total shared cM
  • Total recent shared length in cM
  • Longest recent shared segment in cM
  • Relationship likelihood graph
  • Shared segments plotted on chromosome display
  • Shared segments in a table

Please note that you can click on any graphic to enlarge.

DNA.Land provides what they believe to be an accurate estimate of recent and anciently shared SNA segments.

The match table is a dropdown underneath the chromosome graphic at far right:

For this experiment, I copied the information from the match table and dropped it into a spreadsheet.

DNALand Match Locations

My match information is shown at DNA.Land with Karen as follows:

Matching segments are identified by DNA.Land as either recent or ancient, which I find to be over-simplified at best and misleading or inaccurate at worst. I guess it depends on how you perceive recent and ancient. I think they are trying to convey the concept that larger segments tend to me more recent, and smaller segments tend to be older, but ancient in the genetics field often refers to DNA extracted from exhumed burials from thousands of years ago.  Furthermore, smaller segments can be descended from the same ancestor as larger segments.

GedMatch Match

Since Karen so kindly provided her GedMatch kit number, I signed in to GedMatch and did a one-to-one match with this same kit.

Since all of the segments are 3 cM and over at DNA.Land, I utilized a GedMatch threshold of 3 cM and dropped the SNP count to 100, since a SNP count of 300 gave me few matches. For this comparison, I wanted to see all my matches to Karen, no matter how few SNPs are involved, in an attempt to obtain results similar to DNA.Land. I normally would not drop either of these thresholds this low. My typical minimum is 5cM and 500 SNPs, and even if I drop to 3cM, I still maintain the 500 SNP threshold.

Let’s see how the data from GedMatch and DNA.Land compares.

In my spreadsheet, below, I pasted the segment match information from DNA.Land in the first 5 columns with a red header. Note that DNA.Land does not provide the number of shared SNPs.

At right, I pasted the match information from GedMatch, with a green header. We know that GedMatch has a history of accurately comparing segments, and we can do a cross platform comparison. I originally uploaded my FTDNA file to DNA.Land and Karen uploaded an Ancestry file. Those are the two files I compared at GedMatch, because the same actual matching locations are being compared at both vendors, DNA.Land (in addition to imputed regions) and GedMatch.

I then copied the matching segments from GedMatch (3cM, 100 SNPs threshold) and placed them in the middle columns in the same row where they matched corresponding DNA.Land segments. If any portion of the two vendors segments overlapped, I copied them as a match, although two are small and partial and one is almost negligible. As you can see, there are only 10 segments with any overlap at all in the center section. Please note that I am NOT suggesting these are valid or real matches.  At this point, it’s only a math/match exercise, not an analysis.

The match comparison column (yellow header) is where I commented on the match itself. In some cases, the lack of the number of SNPs at DNA.Land was detrimental to understanding which vendor was a higher match. Therefore, when possible, I marked the higher vendor in the Match Comparison column with the color of their corresponding header.

Analysis

Frankly, I was shocked at the lack of matching between GedMatch and DNA.Land. Trying to understand the discrepancy, I decided to look at the matches between Karen, who has been very helpful, and me at other vendors.

I then looked at our matches at Ancestry, 23andMe, MyHeritage and at Family Tree DNA.

The best comparison would be at Family Tree DNA where Karen loaded her Ancestry file.  Therefore, I’m comparing apples to apples, meaning equivalent to the comparison at GedMatch and DNA.Land (before imputation).

It’s impossible to tell much without a chromosome browser at Ancestry, especially after Timber processing which reduces matching DNA.

DNA.Land categorized my match to Karen as “high certainty.” My match with Karen appears to be a valid match based on the longest segment(s) of approximately 30cM on chromosome 8.

  • Of the 4 segments that DNA.Land identifies as “recent” matches, 2 are not reflected at all in the GedMatch or Family Tree DNA matching, suggesting that these regions were imputed entirely, and incorrectly.
  • Of the 4 segments that DNA.Land identifies as “recent” matches, the 2 on chromosome 8 are actually one segment that imputation apparently divided. According to DNA.LAND, imputation can increase the number of matching segments. I don’t think it should break existing segments, meaning segments actually tested, into multiple pieces. In any event, the two vendors do agree on this match, even though DNA.Land breaks the matching segment into two pieces where GedMatch and Family Tree DNA do not. I’m presuming (I hate that word) that this is the one segment that Ancestry calls as a match as well, because it’s the longest, but Ancestry’s Timber algorithm downgrades the match portion of that segment by removing 11cM (according to DNA.Land) from 29cM to 18cM or removes 13cM (according to both GedMatch and Family Tree DNA) from 31cM to 18cM. Both GedMatch and Family Tree DNA agree and appear to be accurate at 31cM.
  • Of the total 39 matching segments of any size, utilizing the 3cM threshold and 100 SNPs, which I set artificially very low, GedMatch only found 10 matching segments with any portion of the segment in common, meaning that at least 29 were entirely erroneous matches.
  • Resetting the GedMatch match threshold to 3 cM and 300 SNPS, a more reasonable SNP threshold for 3cM, GedMatch only reports 3 matching segments, one of which is chromosome 8 (undivided) which means at this threshold, 36 of the 39 matching DNA.Land segments are entirely erroneous. Setting the threshold to a more reasonable 5cM or 7cM and 500 SNPs would result in only the one match on chromosome 8.

  • If 29 of 39 segments (at 3cM 100 SNPs) are erroneously reported, that equates to 74.36% erroneous matches due to imputation alone, with out considering identical by chance (IBC) matches.
  • If 35 of 39 segments (at 3cM 300 SNPs) are erroneously reported, that equates to 89.74% percent erroneous matches, again without considering those that might be IBC.

Predicted vs Actual

One additional piece of information that I gathered during this process is the predicted relationship.

Vendor Total cM Total Segments Longest Segment Predicted Relationship
DNA.Land 162 to 3 cM 39 to 3 cM 17.3 & 12, split 3C
GedMatch 123 to 3 cM 27 to 3 cM 31.5 5.1 gen distant
Family Tree DNA 40 to 1 cM 12 to 1 cM 32 3-5C
MyHeritage No match No match No match No match
Ancestry 18.1 1 18.1 5-8C
23andMe 26 1 26 3-6C

Karen utilized her Ancestry file and I used my Family Tree DNA file for all of the above matching except at 23andMe and Ancestry where we are both tested on the vendors’ platform. Neither 23andMe nor Ancestry accept uploads. I included the 23andMe and Ancestry comparisons as additional reference points.

The lack of a match at MyHeritage, another company that implements imputation, is quite interesting. Karen and I, even with a significantly sized segment are not shown as a match at MyHeritage.

If imputation actually breaks some matching segments apart, like the chromosome 8 segment at DNA.Land, it’s possible that the resulting smaller individual segments simply didn’t exceed the MyHeritage matching threshold. It would appear that the MyHeritage matching threshold is probably 9cM, given that my smallest segment match of all my matches at MyHeritage is 9cM. Therefore, a 31 or 32 cM segment would have to be broken into 4 roughly equally sized pieces (32/4=8) for the match to Karen not to be detected because all segment pieces are under 9cM. MyHeritage has experienced unreliable matching since their rollout in mid 2016, so their issue may or may not be imputation related.

The Common Ancestor

At Family Tree DNA, Karen does not match my mother, so I can tell positively that she is related through my father’s line. She and I triangulate on our common segment with three other individuals who descend from Abraham Estes 1647-1720 .

Utilizing the chromosome browser, we do indeed match on chromosome 8 on a long segment, which is also our only match over 5cM at Family Tree DNA.

Based on our trees as well as the trees of our three triangulated Estes matches, Karen and I are most probably either 8th cousins, or 8th cousins once removed, assuming that is our only common line. I am 8th cousins with the other three triangulated matches on chromosome 8. Karen’s line has yet to be proven.

Imputation Matching Summary

I like the way that DNA.Land presents some of their features, but as for matching accuracy, you can view the match quality in various ways:

  1. DNA.Land did find the large match on chromosome 8. Of course, in terms of matching, that’s pretty difficult to miss at roughly 30cM, although MyHeritage managed. Imputation did split the large match into two, somehow, even though Karen and I match on that same segment as one segment at other vendors comparing the same files.
  2. Of the 39 DNA.Land total matches, other than the chromosome 8 match, two other matches are partial matches, according to GedMatch. Both are under 7cM.
  3. Of DNA.Land’s total 39 matches, 35 are entirely wrong, in addition to the two that are split, including two inaccurate imputed matches at over 5cM.
  4. At DNA.Land, I’m not so concerned about discerning between “real” and “false” small segment matches, as compared to both FTDNA and GedMatch, as I am about incorrectly imputed segments and matches. Whether small matches in general are false positives or legitimate can be debated, each smaller segment match based on its own merits. Truthfully, with larger segments to deal with, I tend to ignore smaller segments anyway, at least initially. However, imputation adds another layer of uncertainty on top of actual matching, especially, it appears, with smaller matches. Imputing entire segments of incorrect DNA concerns me.
  5. Having said that, I find it very concerning that MyHeritage who also utilizes imputation missed a significant match of over 30cM. I don’t know of a match of this size that has ever been proven to be a false match (through parental phasing), and in this case, we know which ancestor this segment descends from through independent verification utilizing multiple other matches. MyHeritage should have found that match, regardless of imputation, because that match is from portions of the two files that were both tested, not imputed.

Summary

To date, I’m not impressed with imputation matching relative to genetic genealogy at either DNA.Land or MyHeritage.

In one case, that of DNA.Land, imputation shows matches for segments that are not shown as matches at either Family Tree DNA or GedMatch who are comparing the same two testers’ files, but without imputation. Since DNA.Land did find the larger segment, and many of their smaller segments are simply wrong, I would suggest that perhaps they should only show larger segments. Of course, anyone who finds DNA.Land is probably an experienced genetic genealogist and probably already has files at both GedMatch and Family Tree DNA, so hopefully savvy enough to realize there are issues with DNA.Land’s matching.

In the second imputation case, that of MyHeritage, the match with Karen is missed entirely, although that may not be a function of imputation. It’s hard to determine.  MyHeritage is also comparing the same two files uploaded by Karen and I to the other vendors who found that match, both vendors who do and don’t utilize imputation.

Regardless of imputing additional locations, MyHeritage should have found the matching segment on chromosome 8 because that region does NOT need to be imputed. Their failure to do so may be a function of their matching routine and not of imputation itself. At this point, it’s impossible to discern the cause. We only know, based on matching at other vendors, that the non-match at MyHeritage is inaccurate.

Here’s what DNA.Land has to say about the imputed VCF file, which holds all of your imputed values, when you download the file. They pull no punches about imputation.

“Noisey and probabilistic.” Yes, I’d say they are right, and problematic as well, at least for genetic genealogists.

Extrapolating this even further, I find it more than a little frightening that my imputed data at DNA.Land will be utilized for medical research.

Quoting now from Promethease, a medical reference site that allows the consumer to upload their raw data files, providing consumers with a list of SNPs having either positive or negative research in academic literature:

DNA.land will take a person’s data as produced by such companies and impute additional variants based on population frequency statistics. To put this in concrete terms, a person uploading a typical 23andMe file of ~700,000 variants to DNA.land will get back an (imputed) file of ~39 million variants, all predicted to be present in the person. Promethease reports from such imputed files typically contain about 50% more information (i.e. 50% more genotypes) than the corresponding reports from raw (non-imputed) data.

Translated, this means that your imputed data provides twice as much “genetic information” as your actual tested data. The question remains, of course, how much of this imputed data is accurate.

That will be the topic of the third imputation article. Stay tuned.

_____________________________________________________________________

Standard Disclosure

This standard disclosure appears at the bottom of every article in compliance with the FTC Guidelines.

Hot links are provided to Family Tree DNA, where appropriate. If you wish to purchase one of their products, and you click through one of the links in an article to Family Tree DNA, or on the sidebar of this blog, I receive a small contribution if you make a purchase. Clicking through the link does not affect the price you pay. This affiliate relationship helps to keep this publication, with more than 850 articles about all aspects of genetic genealogy, free for everyone.

I do not accept sponsorship for this blog, nor do I write paid articles, nor do I accept contributions of any type from any vendor in order to review any product, etc. In fact, I pay a premium price to prevent ads from appearing on this blog.

When reviewing products, in most cases, I pay the same price and order in the same way as any other consumer. If not, I state very clearly in the article any special consideration received. In other words, you are reading my opinions as a long-time consumer and consultant in the genetic genealogy field.

I will never link to a product about which I have reservations or qualms, either about the product or about the company offering the product. I only recommend products that I use myself and bring value to the genetic genealogy community. If you wonder why there aren’t more links, that’s why and that’s my commitment to you.

Thank you for your readership, your ongoing support and for purchasing through the affiliate link if you are interested in making a purchase at Family Tree DNA.

Using Spousal Surnames and DNA to Unravel Male Lines

When Y DNA matching at Family Tree DNA, it’s not uncommon for men to match other males of the same surname who share the same ancestor. In fact, that’s what we hope for, fervently!

However, if you’re stuck downstream, you may need to figure out which of several male children you descend from.

If you’re staring at a brick wall working yourselves back in time, you may need to try working forward, utilizing various types of information, including wives’ surnames.

For all intents and purposes, this is my Vannoy line, in Wilkes County, NC, so let’s use it as an example, because it embodies both the promise and the peril of this approach.

So, there you sit, disconnected from the Vannoy line. That little yellow box is just so depressing. So close, but yet so far. And yes, we’ve already exhausted the available paper trail records, years ago.

We know the lineage back through Elijah Vannoy, who was born between 1784-1786 in Wilkes County, or vicinity. We know my Vannoy cousin Y DNA matches with other men from the Vannoy line upstream of John Francis Vannoy, the known father of four sons in Wilkes County, NC and the first (and only) Vannoy to move from New Jersey to that part of North Carolina.

Therefore, we know who the candidates are to be Elijah’s father, but the connection in the yellow box is missing. Many Wilkes County records have gone missing over the years and births were not recorded in that timeframe.  The records from neighboring Ashe County where Daniel Vannoy lived burned during the Civil War, although some records did survive. In other words, the records are rather like Swiss cheese. Welcome to genealogy in the south.

Which of John Francis Vannoy’s four sons does Elijah descend from?

Let’s see what we can discover.

Contact Matches and Ask for Help

The first thing I would do is to ask for assistance from your surname matches.

Let’s say that you match a known descendant of each of these four men, meaning each of John Francis Vannoy’s sons. Ask each person if they know where the male Vannoy descendants of each son went along with any documentation they might have. If your ancestor, Elijah in this case, is not found in the same location as the sons, geography may be your friend.

In our case, we know that Francis Vannoy migrated to Knox County, Kentucky, but that was after he signed for his daughter’s marriage in Wilkes Co., NC in 1812. It was also about this time that Elijah Vannoy migrated to Claiborne County, TN, in the same direction, but not the same location. The two locations are an hour away by car today, separated by mountains and the Cumberland Gap, a nontrivial barrier.

We also know that Nathaniel Vannoy left a Bible that did not list Elijah as one of his children, but with a gap large enough to possibly encompass another child.  If you’re thinking to yourself, “Who would leave a child’s birth out of the Bible?,” I though the same thing until I encountered it myself personally in another line.  However, the Bible record does make Nathaniel a less likely father candidate, despite a persistent rumor that Nathaniel was Elijah’s father.

Our only other clues are some tax records recording the number of children in the household of various ages, but none are conclusive. None of these men had wills.

Y DNA Genetic Distance

Your Y DNA matches will show how many mutations you are from them at a particular marker level.

Please note that you can click to enlarge any graphic.

The number of mutations between two men is called the genetic distance.

The rule of thumb is that the more mutations, the further back in time the common ancestor. The problem is, the rule of thumb doesn’t always work. DNA mutates when it darned well pleases, not on any clock that we can measure with that degree of accuracy – at least not accurately enough to tell which of 4 sons a man descends from – unless that line has incurred a defining mutation between the ancestor and the current generation. We call those line marker mutations. To determine the mutation history, you need multiple men from each line to have tested.

You can read more about Y DNA matching in the article, Concepts – Y DNA Matching and Connecting with your Paternal Ancestor.

Check Autosomal DNA Tests

Next, check to see if your Y DNA matches from all Vannoy lines have also taken the autosomal Family Finder test, noted as FF, which shows matches from all ancestral lines, not just the paternal line.

You can see in the match list above that not many have taken the Family Finder test. Ask if they would be willing to upgrade. Be prepared to pay if need be – because you are, after all, the one with the “problem” to solve.

Generally, I simply offer to pay. It’s well worth it to me, and given that paper records don’t exist to answer the question – a DNA test under $100 is cheap. Right now, Family Finder tests are on sale for $69 until the end of the month.

Check for Intermarriage

While you’re waiting for autosomal DNA results, check the pedigrees for all for lines involved to see if you are otherwise related to these men or their wives.

For example, in Andrew Vannoy’s wife’s line and Elijah Vannoy’s wife’s line, we have a common ancestor. George Shepherd and Elizabeth Mary Angelique Daye are common to both lines, and John Shepherd’s wife is unknown, so we have one known problem and one unknown surname.

You can tell already that this could be messy, because we can’t really use Andrew Vannoy’s wife’s line to search for matches because Elijah’s line is likely to match through Andrew’s wife since Susannah Shepherd and Lois McNiel share a common lineage. Rats!

We’ll mark these in red to remind ourselves.

Check Advanced Matching

Family Tree DNA provides a wonderful tool that allows you to compare matches of different kinds of DNA. The Advanced Matching tab is found under “Tools and Apps” under the myFTDNA tab at the upper left.

In this case, I’m going to use the Advanced Match feature to see which of my Vannoy cousin’s Y matches at 37 markers, within the Vannoy DNA project, also match him autosomally.

This report is particularly nice, because it shows number of Y mutations, often indicating distance to a common ancestor, as well as the estimated autosomal relationship range.

You can see in this case that the first Vannoy male, “A,” is a close match both on Y DNA and autosomally, with 1 mutation difference and falling in the 2nd to 4th cousin range, as compared to the second Vannoy male, “D,” who is 3 mutations different and falls into the 4th to remote cousin range.

Not every Vannoy male may have joined the Vannoy project, so you’ll want to run this report a second time, replacing the Vannoy project search criteria with “The Entire Database.”

Unfortunately, not everyone that I need has taken the Family Finder test, so I’ll be contacting a few men, asking if I can sponsor their upgrades.

Let’s move on to our next tactic, using the wives’ surnames.

Search Utilizing the Wife’s Surname

We already know that we can’t rely on the Shepherd surname, so we’ll have to utilize the surnames of the other three wives:

  • Millicent Henderson – parents Thomas Henderson born circa 1730 Virginia, died 1806 Laurens, SC, wife Frances, surname unknown
  • Elizabeth Ray (Raye) – parents William Ray born circa 1725/1730 Herdford, England, died 1783 Wilkes Co., NC (the portion now Ashe Co.,) wife Elizabeth Gordon born circa 1783 Amherst Co., VA and died 1804 Surry Co., NC
  • Sarah Hickerson – parents Charles Hickerson born circa 1725 Stafford Co., VA, died before 1793 Wilkes Co., NC, wife Mary Lytle

Utilizing the Family Finder match search function, I’m going to search for matches that include the wives surnames, but are NOT descended from the Vannoy line.

Hickerson produced no non-Vannoy matches utilizing the matches of my first Vannoy cousin, but Henderson is another matter entirely.

Since the Henderson line would be on my cousin’s father’s side, the matches that are most relevant are the ones phased to his paternal line, those showing the blue person icon.

The surname that you have entered as the search criteria will show as blue in the Ancestral Surname list, at far right, and other matching surnames will show as black. Please note that this includes surnames from ANY person in the match’s tree if they have uploaded a Gedcom file, not just surnames of direct ancestral lines. Therefore, if the match has a tree, it’s important to click on the pedigree icon and search for the surname in question. Don’t assume.

Altogether, there are 76 Henderson matches, of which 17 are phased to his paternal line. You’ll need to review each one of at least the 17. Personally, I would painstakingly review each one of the 76. You never know where a shred of information will be found.

Please note, finding a match with a common surname DOES NOT MEAN THAT YOU MATCH THIS PERSON THROUGH THAT SURNAME. Even finding a person with a common ancestor doesn’t mean that you both descend from that ancestor. You may have a second common ancestor. It means that you have more work to do, as proof, but it’s the beginning you need.

Of course, the first thing we need to do is eliminate any matches who also descend from a Vannoy, because there is no way to know if the matching DNA is through the Vannoy or Henderson lines. However, first, take note of how that person descends from the Vannoy line.

You can see your matches entire surname list by clicking on their profile picture.

The surname, Ray, is more difficult, because the search for Ray also returns names like Bray and Wray, as well as Ray.

But Wait – There’s a Happy Ending!

If you’re thinking, “this is a lot of work,” yes, it is.

Yes, you are absolutely going to do the genealogy of the wives’ lines so you can recognize if and how your matches might connect.

I enter the wives’ lines into my genealogy software and then I search for the ancestors found in my matches trees to see if they descend from that line.

One tip to make this easier is to test multiple people in the same line – regardless of whether they are males or carry the desired surname. They simply need to be descendants – that’s the beauty of autosomal DNA and why I carry kits with me wherever I go.  And yes, I’m really serious about that!

When you have multiple testers from the same line, you can utilize each test independently, searching for each surname in the Family Finder results.  Then, from the surname match list, select a sibling or other close relative with that same surname in their list, then choose the ICW feature. This allows you to see who both of those people match who also carries the Henderson surname in their surname list.

Not successful with that initial cousin’s match results – like I wasn’t with Hickerson?

Rinse and repeat, with every single person who you can find who has descended from the line in question. I started the process over again with a second cousin and a Hickerson search.

About the time you’re getting really, really tired of looking at all of those trees, extending the branches of other people’s lines, and are about to give up and go to bed because it’s 3 AM and you’re discouraged, you see something like this:

Yep, it’s good old Charles Hickerson and Mary Lytle.  I could hardly believe my eyes!!! This Hickerson match to a cousin in my Vannoy line descends from Charles Hickerson’s son, Joshua.

All of a sudden…it’s all worthwhile! Your fatigue is gone, replaced by adrenalin and you couldn’t sleep now if your life depended on it!

Using the ICW (in common with feature) to find additional known cousins who match the person with Charles Hickerson and Mary Lytle in their tree, I found a total of three Vannoy cousins with significant matches.

Using the chromosome browser to compare, I’ve confirmed that one segment is a triangulated match of 12.69 cM (blue) on chromosome 2.

You can read more about triangulation in the article, Concepts – Why Genetic Genealogy and Triangulation? as well as the article, Concepts – Match Groups and Triangulation.

Do I wish I had more than three people in my triangulation group? Yes, of course, but with a match of this size triangulated between cousins and a Hickerson descendant who is a 30 year genealogist, sporting a relatively complete tree and no other common lines, it’s a great place to begin digging deeper! This isn’t the end, but a new beginning!

After obsessively digging through the matches of every Elijah Vannoy descended cousin I can find (sleep is overrated anyway) and whose account I have access to, I have now discovered matches with four additional people who have no other common lines with the Vannoy cousins and who descend from Charles Hickerson and Mary Lytle through sons David and Joseph Hickerson. I can’t tell if they triangulate without access to accounts that I don’t have access to, so I’ve sent e-mails requesting additional information.

WooHoo Happy Day!!! There’s a really big crack in the brick wall and I’ve just witnessed the sunrise of a beautiful, amazing day.

I think Elijah’s parents are…drum roll…Daniel Vannoy and Sarah Hickerson!

Which walls do you need to fall and how can you use this technique?

______________________________________________________________________

Standard Disclosure

This standard disclosure will now appear at the bottom of every article in compliance with the FTC Guidelines.

Hot links are provided to Family Tree DNA, where appropriate. If you wish to purchase one of their products, and you click through one of the links in an article to Family Tree DNA, or on the sidebar of this blog, I receive a small contribution if you make a purchase. Clicking through the link does not affect the price you pay. This affiliate relationship helps to keep this publication, with more than 850 articles about all aspects of genetic genealogy, free for everyone.

I do not accept sponsorship for this blog, nor do I write paid articles, nor do I accept contributions of any type from any vendor in order to review any product, etc. In fact, I pay a premium price to prevent ads from appearing on this blog.

When reviewing products, in most cases, I pay the same price and order in the same way as any other consumer. If not, I state very clearly in the article any special consideration received. In other words, you are reading my opinions as a long-time consumer and consultant in the genetic genealogy field.

I will never link to a product about which I have reservations or qualms, either about the product or about the company offering the product. I only recommend products that I use myself and bring value to the genetic genealogy community. If you wonder why there aren’t more links, that’s why and that’s my commitment to you.

Thank you for your readership, your ongoing support and for purchasing through the affiliate link if you are interested in making a purchase at Family Tree DNA.