Matchmaker, matchmaker, make me a match!
One of the questions I often receive about autosomal DNA is, “What, EXACTLY, is a match?” The answer at first glance seems evident, meaning when you and someone else are shown on each other’s match lists, but it really isn’t that simple.
What I’d like to discuss today is what actually constitutes a match – and the difference between legitimate or real matches and false matches, also called false positives.
Let’s look at a few definitions before we go any further.
- A Match – when you and another person are found on each other’s match lists at a testing vendor. You may match that person on one or more segments of DNA.
- Matching Segment – when a particular segment of DNA on a particular chromosome matches to another person. You may have multiple segment matches with someone, if they are closely related, or only one segment match if they are more distantly related.
- False Match – also known as a false positive match. This occurs when you match someone that is not identical by descent (IBD), but identical by chance (IBC), meaning that your DNA and theirs just happened to match, as a happenstance function of your mother and father’s DNA aligning in such a way that you match the other person, but neither your mother or father match that person on that segment.
- Legitimate Match – meaning a match that is a result of the DNA that you inherited from one of your parents. This is the opposite of a false positive match. Legitimate matches are identical by descent (IBD.) Some IBD matches are considered to be identical by population, (IBP) because they are a result of a particular DNA segment being present in a significant portion of a given population from which you and your match both descend. Ideally, legitimate matches are not IBP and are instead indicative of a more recent genealogical ancestor that can (potentially) be identified.
You can read about Identical by Descent and Identical by Chance here.
- Endogamy – an occurrence in which people intermarry repeatedly with others in a closed community, effectively passing the same DNA around and around in descendants without introducing different/new DNA from non-related individuals. People from endogamous communities, such as Jewish and Amish groups, will share more DNA and more small segments of DNA than people who are not from endogamous communities. Fully endogamous individuals have about three times as many autosomal matches as non-endogamous individuals.
- False Negative Match – a situation where someone doesn’t match that should. False negatives are very difficult to discern. We most often see them when a match is hovering at a match threshold and by lowing the threshold slightly, the match is then exposed. False negative segments can sometimes be detected when comparing DNA of close relatives and can be caused by read errors that break a segment in two, resulting in two segments that are too small to be reported individually as a match. False negatives can also be caused by population phasing which strips out segments that are deemed to be “too matchy” by Ancestry’s Timber algorithm.
- Parental or Family Phasing – utilizing the DNA of your parents or other close family members to determine which side of the family a match derives from. Actual phasing means to determine which parts of your DNA come from which parent by comparing your DNA to at least one, if not both parents. The results of phasing are that we can identify matches to family groups such as the Phased Family Finder results at Family Tree DNA that designate matches as maternal or paternal based on phased results for you and family members, up to third cousins.
- Population Based Phasing – In another context, phasing can refer to academic phasing where some DNA that is population based is removed from an individual’s results before matching to others. Ancestry does this with their Timber program, effectively segmenting results and sometimes removing valid IBD segments. This is not the type of phasing that we will be referring to in this article and parental/family phasing should not be confused with population/academic phasing.
IBD and IBC Match Examples
It’s important to understand the definitions of Identical by Descent and Identical by Chance.
I’ve created some easy examples.
Let’s say that a match is defined as any 10 DNA locations in a row that match. To keep this comparison simple, I’m only showing 10 locations.
In the examples below, you are the first person, on the left, and your DNA strands are showing. You have a pink strand that you inherited from Mom and a blue strand inherited from Dad. Mom’s 10 locations are all filled with A and Dad’s locations are all filled with T. Unfortunately, Mother Nature doesn’t keep your Mom’s and Dad’s strands on one side or the other, so their DNA is mixed together in you. In other words, you can’t tell which parts of your DNA are whose. However, for our example, we’re keeping them separate because it’s easier to understand that way.
Legitimate Match – Identical by Descent from Mother
In the example above, Person B, your match, has all As. They will match you and your mother, both, meaning the match between you and person B is identical by descent. This means you match them because you inherited the matching DNA from your mother. The matching DNA is bordered in black.
Legitimate Match – Identical by Descent from Father
In this second example, Person C has all T’s and matches both you and your Dad, meaning the match is identical by descent from your father’s side.
You can clearly see that you can have two different people match you on the same exact segment location, but not match each other. Person B and Person C both match you on the same location, but they very clearly do not match each other because Person B carries your mother’s DNA and Person C carries your father’s DNA. These three people (you, Person B and Person C) do NOT triangulate, because B and C do not match each other. The article, “Concepts – Match Groups and Triangulation” provides more details on triangulation.
Triangulation is how we prove that individuals descend from a common ancestor.
If Person B and Person C both descended from your mother’s side and matched you, then they would both carry all As in those locations, and they would match you, your mother and each other. In this case, they would triangulate with you and your mother.
False Positive or Identical by Chance Match
This third example shows that Person D does technically match you, because they have all As and Ts, but they match you by zigzagging back and forth between your Mom’s and Dad’s DNA strands. Of course, there is no way for you to know this without matching Person D against both of your parents to see if they match either parent. If your match does not match either parent, the match is a false positive, meaning it is not a legitimate match. The match is identical by chance (IBC.)
One clue as to whether a match is IBC or IBD, even without your parents, is whether the person matches you and other close relatives on this same segment. If not, then the match may be IBC. If the match also matches close relatives on this segment, then the match is very likely IBD. Of course, the segment size matters too, which we’ll discuss momentarily.
If a person triangulates with 2 or more relatives who descend from the same ancestor, then the match is identical by descent, and not identical by chance.
False Negative Match
This last example shows a false negative. The DNA of Person E had a read error at location 5, meaning that there are not 10 locations in a row that match. This causes you and Person E to NOT be shown as a match, creating a false negative situation, because you actually do match if Person E hadn’t had the read error.
Of course, false negatives are by definition very hard to identify, because you can’t see them.
Comparisons to Your Parents
Legitimate matches will phase to your parents – meaning that you will match Person B on the same amount of a specific segment, or a smaller portion of that segment, as one of your parents.
False matches mean that you match the person, but neither of your parents matches that person, meaning that the segment in question is identical by chance, not by descent.
Comparing your matches to both of your parents is the easiest litmus paper test of whether your matches are legitimate or not. Of course, the caveat is that you must have both of your parents available to fully phase your results.
Many of us don’t have both parents available to test, so let’s take a look at how often false positive matches really do occur.
False Positive Matches
How often do false matches really happen?
The answer to that question depends on the size of the segments you are comparing.
Very small segments, say at 1cM, are very likely to match randomly, because they are so small. You can read more about SNPs and centiMorgans (cM) here.
As a rule of thumb, the larger the matching segment as measured in cM, with more SNPs in that segment:
- The stronger the match is considered to be
- The more likely the match is to be IBD and not IBC
- The closer in time the common ancestor, facilitating the identification of said ancestor
Just in case we forget sometimes, identifying ancestors IS the purpose of genetic genealogy, although it seems like we sometimes get all geeked out by the science itself and process of matching! (I can hear you thinking, “speak for yourself, Roberta.”)
It’s Just a Phase!!!
Let’s look at an example of phasing a child’s matches against those of their parents.
In our example, we have a non-endogamous female child (so they inherit an X chromosome from both parents) whose matches are being compared to her parents.
I’m utilizing files from Family Tree DNA. Ancestry does not provide segment data, so Ancestry files can’t be used. At 23andMe, coordinating the security surrounding 3 individuals results and trying to make sure that the child and both parents all have access to the same individuals through sharing would be a nightmare, so the only vendor’s results you can reasonably utilize for phasing is Family Tree DNA.
You can download the matches for each person by chromosome segment by selecting the chromosome browser and the “Download All Matches to Excel (CSV Format)” at the top right above chromosome 1.
All segment matches 1cM and above will be downloaded into a CSV file, which I then save as an Excel spreadsheet.
I downloaded the files for both parents and the child. I deleted segments below 3cM.
About 75% of the rows in the files were segments below 3cM. In part, I deleted these segments due to the sheer size and the fact that the segment matching was a manual process. In part, I did this because I already knew that segments below 3 cM weren’t terribly useful.
|< 3 cM removed||20,461||15,025||17,784|
Because I have the ability to phase these matches against both parents, I wanted to see how many of the matches in each category were indeed legitimate matches and how many were false positives, meaning identical by chance.
How does one go about doing that, exactly?
Downloading the Files
Let’s talk about how to make this process easy, at least as easy as possible.
Step one is downloading the chromosome browser matches for all 3 individuals, the child and both parents.
First, I downloaded the child’s chromosome browser match file and opened the spreadsheet.
Second, I downloaded the mother’s file, colored all of her rows pink, then appended the mother’s rows into the child’s spreadsheet.
Third, I did the same with the father’s file, coloring his rows blue.
After I had all three files in one spreadsheet, I sorted the columns by segment size and removed the segments below 3cM.
Next, I sorted the remaining items on the spreadsheet, in order, by column, as follows:
My resulting spreadsheet looked like this. Sorting in the order prescribed provides you with the matches to each person in chromosome and segment order, facilitating easy (OK, relatively easy) visual comparison for matching segments.
I then colored all of the child’s NON-matching segments green so that I could see (and eventually filter the matchname column by) the green color indicating that they were NOT matches. Do this only for the child, or the white (non-colored) rows. The child’s matchname only gets colored green if there is no corresponding match to a parent for that same person on that same chromosome segment.
All of the child’s matches that DON’T have a corresponding parent match in pink or blue for that same person on that same segment will be colored green. I’ve boxed the matches so you can see that they do match, and that they aren’t colored green.
In the above example, Donald and Gaff don’t match either parent, so they are all green. Mess does match the father on some segments, so those segments are boxed, but the rest of Mess doesn’t match a parent, so is colored green. Sarah doesn’t match any parent, so she is entirely green.
Yes, you do manually have to go through every row on this combined spreadsheet.
If you’re going to phase your matches against your parent or parents, you’ll want to know what to expect. Just because you’ve seen one match does not mean you’ve seen them all.
What is a Match?
So, finally, the answer to the original question, “What is a Match?” Yes, I know this was the long way around the block.
In the exercise above, we weren’t evaluating matches, we were just determining whether or not the child’s match also matched the parent on the same segment, but sometimes it’s not clear whether they do or do not match.
In the case of the second match with Mess on chromosome 11, above, the starting and ending locations, and the number of cM and segments are exactly the same, so it’s easy to determine that Mess matches both the child and the father on chromosome 11. All matches aren’t so straightforward.
This looks like your typical match for one person, in this case, Cecelia. The child (white rows) matches Cecelia on three segments that don’t also match the child’s mother (pink rows.) Those non-matching child’s rows are colored green in the match column. The child matches Cecelia on two segments that also match the mother, on chromosome 20 and the X chromosome. Those matching segments are boxed in black.
The segments in both of these matches have exact overlaps, meaning they start and end in exactly the same location, but that’s not always the case.
And for the record, matches that begin and/or end in the same location are NOT more likely to be legitimate matches than those that start and end in different locations. Vendors use small buckets for matching, and if you fall into any part of the bucket, even if your match doesn’t entirely fill the bucket, the bucket is considered occupied. So what you’re seeing are the “fuzzy” bucket boundaries.
In this case, Chad’s match overhangs on each end. You can see that Chad’s match to the child begins at 52,722,923 before the mother’s match at 53,176,407.
At the end location, the child’s matching segment also extends beyond the mother’s, meaning the child matches Chad on a longer segment than the mother. This means that the segment sections before 53,176,407 and after 61,495,890 are false negative matches, because Chad does not also match the child’s mother of these portions of the segment.
This segment still counts as a match though, because on the majority of the segment, Chad does match both the child and the mother.
This example shows a nested match, where the parent’s match to Randy begins before the child’s and ends after the child’s, meaning that the child’s matching DNA segment to Randy is entirely nested within the mother’s. In other words, pieces got shaved off of both ends of this segment when the child was inheriting from her mother.
No Common Matches
Sometimes, the child and the parent will both match the same person, but there are no common segments. Don’t read more into this than what it is. The child’s matches to Mary are false matches. We have no way to judge the mother’s matches, except for segment size probability, which we’ll discuss shortly.
Look Ma, No Parents
In this case, the child matches Don on 5 segments, including a reasonably large segment on chromosome 9, but there are no matches between Don and either parent. I went back and looked at this to be sure I hadn’t missed something.
This could, possibly, be an instance of an unseen a false negative, meaning perhaps there is a read issue in the parent’s file on chromosome 9, precluding a match. However, in this case, since Family Tree DNA does report matches down to 1cM, it would have to be an awfully large read error for that to occur. Family Tree DNA does have quality control standards in place and each file must pass the quality threshold to be put into the matching data base. So, in this case, I doubt that the problem is a false negative.
Just because there are multiple IBC matches to Don doesn’t mean any of those are incorrect. It’s just the way that the DNA is inherited and it’s why this type of a match is called identical by chance – the key word being chance.
This split match is very interesting. If you look closely, you’ll notice that Diane matches Mom on the entire segment on chromosome 12, but the child’s match is broken into two. However, the number of SNPs adds up to the same, and the number of cM is close. This suggests that there is a read error in the child’s file forcing the child’s match to Diane into two pieces.
If the segments broken apart were smaller, under the match threshold, and there were no other higher matches on other segments, this match would not be shown and would fall into the False Negative category. However, since that’s not the case, it’s a legitimate match and just falls into the “interesting” category.
The Deceptive Match
Don’t be fooled by seeing a family name in the match column and deciding it’s a legitimate match. Harrold is a family surname and Mr. Harrold does not match either of the child’s parents, on any segment. So not a legitimate match, no matter how much you want it to be!
Suspicious Match – Probably not Real
This technically is a match, because part of the DNA that Daryl matches between Mom and the child does overlap, from 111,236,840 to 113,275,838. However, if you look at the entire match, you’ll notice that not a lot of that segment overlaps, and the number of cMs is already low in the child’s match. There is no way to calculate the number of cMs and SNPs in the overlapping part of the segment, but suffice it to say that it’s smaller, and probably substantially smaller, than the 3.32 total match for the child.
It’s up to you whether you actually count this as a match or not. I just hope this isn’t one of those matches you REALLY need. However, in this case, the Mom’s match at 15.46 cM is 99% likely to be a legitimate match, so you really don’t need the child’s match at all!!!
So, Judge Judy, What’s the Verdict?
How did our parental phasing turn out? What did we learn? How many segments matched both the child and a parent, and how many were false matches?
In each cM Size category below, I’ve included the total number of child’s match rows found in that category, the number of parent/child matches, the percent of parent/child matches, the number of matches to the child that did NOT match the parent, and the percent of non-matches. A non-match means a false match.
So, what the verdict?
It’s interesting to note that we just approach the 50% mark for phased matches in the 7-7.99 cM bracket.
The bracket just beneath that, 6-6.99 shows only a 30% parent/child match rate, as does 5-5.99. At 3 cM and 4 cM few matches phase to the parents, but some do, and could potentially be useful in groups of people descended from a known common ancestor and in conjunction with larger matches on other segments. Certainly segments at 3 cM and 4 cM alone aren’t very reliable or useful, but that doesn’t mean they couldn’t potentially be used in other contexts, nor are they always wrong. The smaller the segment, the less confidence we can have based on that segment alone, at least below 9-15cM.
Above the 50% match level, we quickly reach the 90th percentile in the 9-9.99 cM bracket, and above 10 cM, we’re virtually assured of a phased match, but not quite 100% of the time.
It isn’t until we reach the 16cM category that we actually reach the 100% bracket, and there is still an outlier found in the 18-18.99 cM group.
I went back and checked all of the 10 cM and over non-matches to verify that I had not made an error. If I made errors, they were likely counting too many as NON-matches, and not the reverse, meaning I failed to visually identify matches. However, with almost 6000 spreadsheet rows for the child, a few errors wouldn’t affect the totals significantly or even noticeably.
I hope that other people in non-endogamous populations will do the same type of double parent phasing and report on their results in the same type of format. This experiment took about 2 days.
Furthermore, I would love to see this same type of experiment for endogamous families as well.
If you can phase your matches to either or both of your parents, absolutely, do. This this exercise shows why, if you have only one parent to match against, you can’t just assume that anyone who doesn’t match you on your one parent’s side automatically matches you from the other parent. At least, not below about 15 cM.
Whether you can phase against your parent or not, this exercise should help you analyze your segment matches with an eye towards determining whether or not they are valid, and what different kinds of matches mean to your genealogy.
If nothing else, at least we can quantify the relatively likelihood, based on the size of the matching segment, in a non-endogamous population, a match would match a parent, if we had one to match against, meaning that they are a legitimate match. Did you get all that?
In a nutshell, we can look at the Parent/Child Phased Match Chart produced by this exercise and say that our 8.5 cM match has about a 66% chance of being a legitimate match, and our 10.5 cM match has a 95% change of being a legitimate match.