Recently, Blaine Bettinger published V4 of the Shared cM Project, and along with that, Jonny Perl at DNAPainter updated the associated interactive tool as well, including histograms. I wrote about that, here.
The goal of the shared cM project was and remains to document how much DNA can be expected to be shared by various individuals at specific relationship levels. This information allows matches to at least minimally “position” themselves in a general location their trees or conversely, to eliminate specific potential relationships.
Shared cM Project match data is gathered by testers submitting their match information through the submission portal, here.
I’ve done the same thing this year, adding the new data to the previous release’s table.
Compiled Comparison Table
I initially compiled this table for myself, then decided to update it and share with my readers. This chart allows me to view various perspectives on shared data and relationships and in essence has all the data I might need, including multiple versions, in one place. Feel free to copy and save the table.
In the comparison table below, the relationship rows with data from various sources is shown as follows:
- White – Shared cM Project 2016
- Peach – Shared cM Project 2017
- Purple – Shared cM Project 2020
- Green – DNA Detectives chart
I don’t know if DNA Detectives still uses the “green chart” or if they have moved to the interactive DNAPainter tool. I’ve retained the numbers for historical reference regardless.
Additionally, in some places, you’ll see references to the “degree of relationship,” as in “third degree relatives always match each other.” I’ve included a “Degree of Relationship” column to the far right, but I don’t come across those “relationship degree” references often anymore either. However, it’s here for reference if you need it.
23andMe still gives relationships in percentages, so I’ve included the expected shared percent of DNA for each relationship and the actual shared range from the DNA Detectives Green Chart.
One column shows the expected shared cM amount, assuming that 50% of the DNA from each ancestor is passed on in each generation. Clearly, we know that inheritance doesn’t happen that cleanly because recombination is a random event and children do NOT inherit exactly half of each ancestor’s DNA carried by their parents, but the average should be someplace close to this number.
The first thing I noticed about V4 is that there is a LOT more data which means that the results are likely more accurate. V4 increased by 32K data points, or 147%. Bravo to everyone who participated, to Blaine for the analysis and to Jonny for automating the results at DNAPainter.
Blaine provided his white paper, here, which includes “everything you need to know” about the project, and I strongly encourage you to read it. Not only does this document explain the process and methods, it’s educational in its own right.
On the first page, Blaine discusses issues. Any time you are crowd sourcing information, you’re going to encounter challenges and errors. Blaine did remove any entries that were clearly problematic, plus an additional 1% of all entries for each category – .5% from each end meaning the largest and smallest entries. This was done in an attempt to remove the results most likely to be erroneous.
Known issues include:
- Data entry errors – I refer to these as “clerical mutations,” but they happen and there is no way, unless the error is egregious, to know what is a typo and what is real. Obviously, a parent sharing only a 10 cM segment with a child is not possible, but other data entry errors are well within the realm of possible.
- Incorrect relationships – Misreported or misunderstood relationships will skew the numbers. Relationships may be believed to be one type, but are actually something else. For example, a half vs full sibling, or a half vs full aunt or uncle.
- Misunderstood Relationships – People sometimes become confused as to the difference between “half” and “removed” from time to time. I wrote a helpful article titled Quick Tip – Calculating Cousin Relationships Easily.
- Endogamy – Endogamy occurs when a population intermarries within itself, meaning that the same ancestral DNA is present in many members of the community. This genetic result is that you may share more DNA with those cousins than you would otherwise share with cousins at the same distance without endogamy.
- Pedigree Collapse – Pedigree collapse occurs when you find the same ancestors multiple times in your tree. The closer to current those ancestors appear, the more DNA you will potentially carry from those repeat ancestors. The difference between endogamy and pedigree collapse is that endogamy is a community event and pedigree collapse has only to do with your own tree. You might just have both, too.
- Company Reporting Differences – Different companies report DNA in different ways in addition to having different matching thresholds. For example, Family Tree DNA includes in your match total all DNA to 1 cM that you share with a match over the matching threshold. Conversely, Ancestry has a lower matching threshold, but often strips out some matching DNA using Timber. 23andMe counts fully identical segments twice and reports the X chromosome in their totals. MyHeritage does not report the X chromosome. There is no “right” or “wrong,” or standardization, simply different approaches. Hopefully, the variances will be removed or smoothed in the averages.
- Distant Cousin Relationships – While this isn’t really an issue, per se, it’s important to understand what is being reported beyond 2nd cousin relationships in that the only relationships used to calculate these averages is the DNA from people who DO share DNA with their more distant cousins. In other words, if you do NOT match your 3rd cousin, then your “0” shared DNA is not included in the average. Only those who do match have their matching amounts included. This means that the average is only the average of people who match, not the average of all 3rd cousins.
Challenges aside, the Shared cM Project provides genealogists with a wonderful opportunity to use the combined data of tens of thousands of relationships to estimate and better understand the relationship range of our matches.
When analyzing the data, one of the first things I noticed was a very unusual entry for parent/child relationships.
We all know that children each inherit exactly half of their parent’s DNA. We expect to find an amount in the ballpark of 3400, give or take a bit for normal variances like read errors or reporting differences.
I did not expect to see a minimum shared cM amount for a child/parent relationship at 2376, fully 1024 cM below expected value of 3400 cM. Put bluntly, that’s simply not possible. You cannot live without one third of one of your parent’s DNA. If this data is actually accurate from someone’s account, please contact me because I want to actually see this phenomenon.
I reached out to Blaine, knowing this result is not actually possible, wondering how this would ever get through the quality control cycle at any vendor.
After some discussion, here’s Blaine’s reply:
If you look at the histogram, you’ll see that those are most likely outliers. One of my lessons for the ScP (Shared cM Project) lately is that people shouldn’t be using the data without the histograms.
People get frustrated with this, but I can’t edit data without a basis even if I think it doesn’t make sense. I have to let the data itself decide what data to remove. So I removed 1% from each relationship, the lowest 0.5% and the highest 0.5%. I could have removed more, but based on the histograms, [removing] more appeared to be removing too much valid data. As people submit more parent/child relationships these outliers/incorrect submissions will be removed. But thankfully using the histograms makes it clear.
Indeed, if you look on page 23 on Blaine’s white paper, you’ll see the following histogram of parent/child relationships submitted.
Keep in mind that Blaine already removed any obvious errors, plus 1% of the total from either end of the spectrum. In this case, he utilized 2412 submissions, so he would have removed about 24 entries that were even further out on the data spectrum.
On the chart above, we can see that a total of about 14 are still really questionable. It’s not until we get to 3300 that these entries seem feasible. My speculation is that these people meant to type 3400 instead of 2400, and so forth.
The great news is that Jonny Perl at DNAPainter included the histograms so you can judge for yourself if you are in the weeds on the outlier scale by clicking on the relationship.
Other relationships, like this niece/nephew relationship fit the expected bell shaped curve very nicely.
Of course, this means that if you match your niece or nephew at 900 cM instead of the range shown above, that person is probably not your full niece or nephew – a revelation that may be difficult because of the implications for you, your parent and sibling. This would suggest that your sibling is a half sibling, not a full sibling.
Entering specific amounts of shared DNA and outputting probabilities of specific relationships is where the power of DNAPainter enters the picture. Let’s enter 900 cM and see what happens.
That 900 cM match is likely your half niece or nephew. Of course, this example illustrates perfectly why some relationships are entered incorrectly – especially if you don’t know that your niece or nephew is a half niece or nephew – because your sibling is a half-sibling instead of a full sibling. Some people, even after receiving results don’t realize there is a discrepancy, either because their data is on the boundary, with various relationships being possible, or because they don’t understand or internalize the genetic message.
This phenomenon probably explains the low minimum value for full siblings, because many of those full siblings aren’t. Let’s enter 1613 and see what DNAPainter says.
You’ll notice that DNAPainter shows the 1613 cM relationship as a half-sibling.
And the histogram indeed shows that 1613 would be the outlier. Being larger that 1600, it would appear in the 1700 category.
Accurately discerning close relationships is often incredibly important to testers. In the histogram chart above, you can see that the blue and orange histograms plotted on the same chart show that there is only a very small amount of overlap between the two histograms. This suggests that some people, those in the overlap range, who believe they are full siblings are in reality half-siblings, and possibly, a few in the reverse situation as well.
What Else is Noteworthy?
First, some relationships cannot be differentiated or sorted out by using the cM data or histogram charts alone.
For example, you cannot tell the difference between half-siblings and an aunt/uncle relationship. In order to make that determination, you would need to either test or compare to additional people or use other clues such as genealogical research or geographic proximity.
Second, the ranges of many relationships are wider than they were before. Often, we see the lows being lower and the highs being higher as a result of more data.
For example, take a look at grandparents. The expected relationship is 1700 cM, the average is 1754 which is very close to the previous average numbers of 1765 and 1766. However, the minimum is now 984 and the new maximum is 2462.
Why might this be? Are ranges actually wider?
Blaine removed 1% each time, which means that in V3, 6 results would have been removed, 3 from each end, while 11 would be removed in V4. More data means that we are likely to see more outliers as entries increase, with the relationship ranges are increasingly likely to overlap on the minimum and maximum ends.
Third, it’s worth noting that several relationships share an expected amount of DNA that is equal, 12.5% which equals 850 cM, in this example.
These four relationships appear to be exactly the same, genetically. The only way to tell which one of these relationships is accurate for a given match pair, aside from age (sometimes) and opportunity, is to look at another known relationship. For example, how closely might the tester be related to a parent, sibling, aunt, uncle or first cousin, or one of their other matches. Occasionally, an X chromosome match will be enlightening as well, given the unique inheritance path of the X chromosome.
Fourth, it’s been believed for several years that all 5th degree relatives, and above, match, and the V4 data confirms that.
There are no zeroes in the column for minimum DNA shared, 4th column from right.
5th degree relatives include:
- 2nd cousins
- 1st cousins twice removed
- Half first cousins once removed
- Half great-aunt/uncle
Fifth, some of your more distant cousins won’t match you, beginning with 6th degree relationships.
At the 6th degree level, the following relationships may share no DNA above the vendor matching threshold:
- First cousins three times removed
- Half first cousins twice removed
- Half second cousins
- Second cousins once removed
You’ll notice that the various reporting models and versions don’t always agree, with earlier versions of the Shared cM Project showing zeroes in the minimum amount of DNA shared.
Sixth, at the 7th degree level, some number of people in every relationship class don’t share DNA, as indicated by the zeros in the Shared cM Minimum column.
The more generations back in time that you move, the fewer cousins can be expected to match.
This chart from the ISOGG Wiki Cousin statistics page shows the probability of matching a cousin at a specific level based on information provided by testing companies.
Quick Reference Chart Summary
In summary, V4 of the Shared cM Project confirms that all 2nd cousins can expect to match, but beyond that in your trees, cousins may or may not match. I suspect, without evidence, that the further back in time that people are related, the less likely that the proper “cousinship level” is reported. For example, it would be easier to confuse 7th and 8th cousins as compared to 1st and 2nd cousins. Some people also confuse 8th cousins with 8 generations back in your tree. It’s not equivalent.
It’s interesting to note that Degree 17 relatives, 8th cousins, 9 generations removed from each other (counting your parents as generation 1), still match in some cases. Note that some companies and people count you as generation 1, while others count your parents as generation 1.
The estimates of autosomal matching reaching 5 or 6 generations back in time, meaning descendants of common 4 times great-grandparents will sometimes match, is accurate as far as it goes, although 5-6 generations is certainly not a line in the sand.
It would be more accurate to state that:
- 2nd cousins, people descended from common great-grandparents, 3 generations back in time will always match
- 4th cousins, people descended from common 3 times great grandparents, 5 generations back in time, will match about half of the time
- 8th cousins, people descended from 7 times great grandparents, 9 generations back in time still match a small percentage of the time
- Cousins from more distant ancestors can possibly match, but it’s unlikely and may result from a more recent unknown ancestor
I created this summary chart, combining information from the ISOGG chart and the Shared cM Project as a handy quick reference. Enjoy!
Thank you so much.
DNA Purchases and Free Transfers
- FamilyTreeDNA – Y, mitochondrial and autosomal DNA testing
- MyHeritage DNA – ancestry autosomal DNA only, not health
- MyHeritage DNA plus Health
- MyHeritage FREE DNA file upload – transfer your results from other vendors free
- AncestryDNA – autosomal DNA only
- 23andMe Ancestry – autosomal DNA only, no Health
- 23andMe Ancestry Plus Health
Genealogy Products and Services
- MyHeritage FREE Tree Builder – genealogy software for your computer
- MyHeritage Subscription with Free Trial
- Legacy Family Tree Webinars – genealogy and DNA classes, subscription based, some free
- Legacy Family Tree Software – genealogy software for your computer
- Charting Companion – Charts and Reports to use with your genealogy software or FamilySearch
- Legacy Tree Genealogists – professional genealogy research
Fun DNA Stuff
- Celebrate DNA – customized DNA themed t-shirts, bags and other items