Site icon DNAeXplained – Genetic Genealogy

Concepts – Relationship Predictions

One of the ways people utilize autosomal DNA for genealogical matching is by looking for common segments of DNA that match with known, or unknown, relatives.  When the relationship to the person is unknown, we attempt to utilize how much DNA we share with that person as a predictor of how, or at what level, we’re related to them – so in essence where or how far back we might look in our tree for a common ancestor.

Until recently, the best estimate we had in terms of how much DNA someone of a particular relationship (like first cousin) could be expected to share both in terms of percentages and also cMs (centiMorgans) of DNA was the table on the ISOGG wiki page.  Often, these expected averages didn’t mesh well with what the community was seeing in reality.

Recently, Blaine Bettinger’s crowdsourced Shared cM Project reported the averages for each relationship level, plus the range represented from lowest to highest in a project where more than 10,000 people participated by providing match information.

Additionally, before publication, Blaine worked with a statistician to remove outliers in each category that might represent data entry errors, etc. Not only did Blaine write a nice blog article about this latest data release, he also wrote a corresponding paper that is downloadable that includes tables and histograms not in his blog.

I am constantly looking between the two sources, meaning the ISOGG table and Blaine’s paper, so as an effort in self-preservation, I combined the information I use routinely from the two tables – and did some analysis in the process.  Let’s take a look.

The Combined Expected cM and Actual Shared cM Chart

On the chart below, the two yellow headed columns are the Expected Shared cMs from the ISOGG table and Blaine’s Shared cM Average – which is the average amount of DNA that was actually found. These, along with the percent of shared DNA are the columns I use most often, followed by Blaine’s minimum and maximum which are the ranges of matching DNA found for each category.  As it turns out, the range is incredibly important – perhaps more important than the averages expected or reported – because the ranges are what we actually see in real life.

I’ve also included the number of respondents, because categories with a larger number of respondents are more likely to be more accurate than categories with only a few, like great-great-aunt/uncle with only 6.

It’s interesting that the greatest number of respondents fell into the aunts/uncles niece/nephew category with second cousins once removed a very close contender.

These were followed by the next closest categories being, in order; first cousins, second cousins, first cousins once removed, second cousins once removed and third cousins.

Note:  If you downloaded this chart on August 4, 2016, there was an error in the maximum number of first cousins trice removed.  On August 5, 2016, it was corrected to read 413.

You can see that in reality, all categories except two produced larger than the expected cM value. One category was equal and one was smaller (yes I checked to be sure I hadn’t transcribed incorrectly). Actual numbers with higher values are peach colored, lower is green and white is equal.

Most averages aren’t dramatically different for close relationships, but as you move further out, the difference in the averages is significantly greater.  Beginning with third cousins once removed, and every category below that in the chart, the actual average is more than twice that of the expected average.  In addition, the ranges for all categories are wider than expected, especially the further out you go in terms of relationships.

We often wonder why the relationship predictions, especially beyond first or second cousins vary so widely at the testing companies and GedMatch. In the chart above, you can see that beyond first cousins, the ranges begin to overlap.

Ranges of the same relationship degree should share the same percentages and theoretically, the same amounts of DNA, but they don’t. You can see that the cells marked in red are all 4th degree relatives.  However, half first cousins show a maximum of 580, with the two following rows showing 704 and 580 – all 4th degree relatives. There’s a pretty significant difference between 580 and 704.

Through 5th degree relatives, everyone matched at some level, meaning the minimum is above zero, but beginning with 6th degree relatives, row highlighted in yellow, some people did not match relatives at that level, meaning the minimum is zero.

In the last 4 rows on the chart, 15th, 16th and 17th degree relatives, marked with light aqua, where academically we “should” share 0% of our DNA, we see that the observed average is from 7 to 11 cM and the range is up to 29 cM.

An example of why predictions are so difficult is that if you are on the high end of the 4th cousins range, a 9th degree relative with 91 shared cMs of DNA, you are also right at the average between 6th and 7th degree relatives which fall into the half second cousin or third cousin range.

Without relationship knowledge, the vendor, based on averages, is going to call this relationship a 2nd or 3rd cousin, when in reality, it’s a 4th cousin. Most vendor relationship predictions are based on a combination of total shared cMs and longest block, but still, it’s easy to be outside the norm.  In other words, not only does one size not fit all, it probably doesn’t fit most.

Graphs

For me, graphs help make information understandable because I can see the visual comparison.

These overlapping ranges are much easier to visualize using charts.  Please note that you can click on any image for a larger view.

The values and ranges for 1st, 2nd and 3rd degree relatives are so much larger than more distant relatives, that you can’t effectively see the information for more distant relatives, so I’ve broken the charts apart, below.

This first chart, above, shows third degree relatives and closer.  Note that the purple maximum for aunts/uncles, nieces/nephews is larger than the minimum for full siblings and greater than the red average or blue expected for half-siblings.

This second chart shows the more distant relationships, meaning 4th degree through 17th degree relatives, but the more distant relationships are still difficult to see, so let’s switch to bar charts and smaller groups.

This first bar chart includes parent/child through first cousins relationships, or 1st through 3rd degree relatives. You can see that the first cousin maximum range (purple) overlaps the aunt/uncle, grandparents and half-sibling minimum ranges (green.) Half sibling max and full sibling minimum are very close.

The balance of relationships are a bit small to view in one chart, but the ranges do overlap significantly.  Unfortunately, Excel does us the favor of skipping some labels on the left side of the chart.

Removing the legend helps a bit, but not much.  Please refer to the color legend in the same graph above.  I’ve further divided the groups below.

The chart above shows 4th degree to 9th degree relatives, meaning great-great-aunts or uncles through 4th cousins.

The same chart, above, with the legend removed to allow for more viewing space.  You’ll notice that at the half second cousins level, and more distant, the green minimum disappears, which means that some people have no matches, so the minimum cM shared is zero for some people with this relationship level.  However, based on the average and maximum, many people do share DNA with people at that relationship level.

The chart above begins with 7th degree relatives, half second cousins, where you share less than 1% of your DNA.

In many cases, the purple maximum range for one relationship category overlaps Blaine’s average and the expected values for other categories.  For example, in the chart below, you can see that the maximum purple bar for the various 5th cousin ranges is higher than the third cousin, twice removed red shared cM average, and significantly higher than the blue expected shared cM value.  In fact, the 6th cousin purple max is nearly the same as the blue expected cM for third cousins once removed.  Note that Excel showed only every other category on the left hand axis, so you’ll need to refer to the actual data chart from time to time.

I’ve removed the legend again so you can see the actual stacked ranges more clearly.  All of the 7th degree relatives have a minimum of zero, so there is no green bar.  Furthermore, at 5th cousins twice removed, the expected shared cMs drops to below 1, so the blue bar is nearly indistinguishable.

This last chart shows the smallest group, 9th through 17th degrees, or 4th through 8th cousins.

On this final chart, we clearly see that Blaine’s actual red shared cMs and the purple maximum are significantly more pronounced than the blue expected shared cMs.  Some people share no DNA at this level, which is to be expected, but a non-trivial number of people share significantly more than is mathematically expected.  There are no absolutes.

Summary

DNA is not always inherited in the fashion or amount expected, and that wide variance is why we see what people believe are “false positive” relationship predictions. In reality, the best the vendors can do is to work with the averages.  This also explains why it’s so difficult for us to estimate or determine how a person might be connected based just on the relationship or generational prediction.  It’s just that, a prediction based on averages which may or may not reflect reality.

There’s a lot we don’t know yet about inheritance – why certain segments are passed on, often intact, sometimes for many generations, and some segments are not.  We don’t know how segments are “selected” for inheritance and we don’t yet know why some segments appear to be “sticky” meaning they show up more in descendants than other segments.

Close relationships are relatively easy, or easier, to predict, at least by relationship degree, but further distant ones are almost impossible to predict accurately based on either academic inheritance models or Blaine’s crowdsourced average cM information.

Here’s a clean copy of the combined chart for your use.

Note:  If you downloaded this chart on August 4, 2016, there was an error in the maximum number of first cousins trice removed.  On August 5, 2016, it was corrected to read 413.

______________________________________________________________

Disclosure

I receive a small contribution when you click on some of the links to vendors in my articles. This does NOT increase the price you pay but helps me to keep the lights on and this informational blog free for everyone. Please click on the links in the articles or to the vendors below if you are purchasing products or DNA testing.

Thank you so much.

DNA Purchases and Free Transfers

Genealogy Services

Genealogy Research

Exit mobile version