Concepts – Relationship Predictions

One of the ways people utilize autosomal DNA for genealogical matching is by looking for common segments of DNA that match with known, or unknown, relatives.  When the relationship to the person is unknown, we attempt to utilize how much DNA we share with that person as a predictor of how, or at what level, we’re related to them – so in essence where or how far back we might look in our tree for a common ancestor.

Until recently, the best estimate we had in terms of how much DNA someone of a particular relationship (like first cousin) could be expected to share both in terms of percentages and also cMs (centiMorgans) of DNA was the table on the ISOGG wiki page.  Often, these expected averages didn’t mesh well with what the community was seeing in reality.

Recently, Blaine Bettinger’s crowdsourced Shared cM Project reported the averages for each relationship level, plus the range represented from lowest to highest in a project where more than 10,000 people participated by providing match information.

Additionally, before publication, Blaine worked with a statistician to remove outliers in each category that might represent data entry errors, etc. Not only did Blaine write a nice blog article about this latest data release, he also wrote a corresponding paper that is downloadable that includes tables and histograms not in his blog.

I am constantly looking between the two sources, meaning the ISOGG table and Blaine’s paper, so as an effort in self-preservation, I combined the information I use routinely from the two tables – and did some analysis in the process.  Let’s take a look.

The Combined Expected cM and Actual Shared cM Chart

On the chart below, the two yellow headed columns are the Expected Shared cMs from the ISOGG table and Blaine’s Shared cM Average – which is the average amount of DNA that was actually found. These, along with the percent of shared DNA are the columns I use most often, followed by Blaine’s minimum and maximum which are the ranges of matching DNA found for each category.  As it turns out, the range is incredibly important – perhaps more important than the averages expected or reported – because the ranges are what we actually see in real life.

I’ve also included the number of respondents, because categories with a larger number of respondents are more likely to be more accurate than categories with only a few, like great-great-aunt/uncle with only 6.

It’s interesting that the greatest number of respondents fell into the aunts/uncles niece/nephew category with second cousins once removed a very close contender.

These were followed by the next closest categories being, in order; first cousins, second cousins, first cousins once removed, second cousins once removed and third cousins.

Note:  If you downloaded this chart on August 4, 2016, there was an error in the maximum number of first cousins trice removed.  On August 5, 2016, it was corrected to read 413.

Expected vs shared cM 4

You can see that in reality, all categories except two produced larger than the expected cM value. One category was equal and one was smaller (yes I checked to be sure I hadn’t transcribed incorrectly). Actual numbers with higher values are peach colored, lower is green and white is equal.

Most averages aren’t dramatically different for close relationships, but as you move further out, the difference in the averages is significantly greater.  Beginning with third cousins once removed, and every category below that in the chart, the actual average is more than twice that of the expected average.  In addition, the ranges for all categories are wider than expected, especially the further out you go in terms of relationships.

We often wonder why the relationship predictions, especially beyond first or second cousins vary so widely at the testing companies and GedMatch. In the chart above, you can see that beyond first cousins, the ranges begin to overlap.

Ranges of the same relationship degree should share the same percentages and theoretically, the same amounts of DNA, but they don’t. You can see that the cells marked in red are all 4th degree relatives.  However, half first cousins show a maximum of 580, with the two following rows showing 704 and 580 – all 4th degree relatives. There’s a pretty significant difference between 580 and 704.

Through 5th degree relatives, everyone matched at some level, meaning the minimum is above zero, but beginning with 6th degree relatives, row highlighted in yellow, some people did not match relatives at that level, meaning the minimum is zero.

In the last 4 rows on the chart, 15th, 16th and 17th degree relatives, marked with light aqua, where academically we “should” share 0% of our DNA, we see that the observed average is from 7 to 11 cM and the range is up to 29 cM.

An example of why predictions are so difficult is that if you are on the high end of the 4th cousins range, a 9th degree relative with 91 shared cMs of DNA, you are also right at the average between 6th and 7th degree relatives which fall into the half second cousin or third cousin range.

Without relationship knowledge, the vendor, based on averages, is going to call this relationship a 2nd or 3rd cousin, when in reality, it’s a 4th cousin. Most vendor relationship predictions are based on a combination of total shared cMs and longest block, but still, it’s easy to be outside the norm.  In other words, not only does one size not fit all, it probably doesn’t fit most.

Graphs

For me, graphs help make information understandable because I can see the visual comparison.

These overlapping ranges are much easier to visualize using charts.  Please note that you can click on any image for a larger view.

Expected full range 4

The values and ranges for 1st, 2nd and 3rd degree relatives are so much larger than more distant relatives, that you can’t effectively see the information for more distant relatives, so I’ve broken the charts apart, below.

Expected to third degree

This first chart, above, shows third degree relatives and closer.  Note that the purple maximum for aunts/uncles, nieces/nephews is larger than the minimum for full siblings and greater than the red average or blue expected for half-siblings.

expected 4th to 17th 4

This second chart shows the more distant relationships, meaning 4th degree through 17th degree relatives, but the more distant relationships are still difficult to see, so let’s switch to bar charts and smaller groups.

Expected stacked to third

This first bar chart includes parent/child through first cousins relationships, or 1st through 3rd degree relatives. You can see that the first cousin maximum range (purple) overlaps the aunt/uncle, grandparents and half-sibling minimum ranges (green.) Half sibling max and full sibling minimum are very close.

Expected 4th to 17th stacked 4

The balance of relationships are a bit small to view in one chart, but the ranges do overlap significantly.  Unfortunately, Excel does us the favor of skipping some labels on the left side of the chart.

Expected 4th to 17th no legend 4

Removing the legend helps a bit, but not much.  Please refer to the color legend in the same graph above.  I’ve further divided the groups below.

Expected 4th to 9th 4

The chart above shows 4th degree to 9th degree relatives, meaning great-great-aunts or uncles through 4th cousins.

expected 4th to 9th no legend 4

The same chart, above, with the legend removed to allow for more viewing space.  You’ll notice that at the half second cousins level, and more distant, the green minimum disappears, which means that some people have no matches, so the minimum cM shared is zero for some people with this relationship level.  However, based on the average and maximum, many people do share DNA with people at that relationship level.

Expected 7th to 17th 4

The chart above begins with 7th degree relatives, half second cousins, where you share less than 1% of your DNA.

In many cases, the purple maximum range for one relationship category overlaps Blaine’s average and the expected values for other categories.  For example, in the chart below, you can see that the maximum purple bar for the various 5th cousin ranges is higher than the third cousin, twice removed red shared cM average, and significantly higher than the blue expected shared cM value.  In fact, the 6th cousin purple max is nearly the same as the blue expected cM for third cousins once removed.  Note that Excel showed only every other category on the left hand axis, so you’ll need to refer to the actual data chart from time to time.

expected 7th to 17th no legend

I’ve removed the legend again so you can see the actual stacked ranges more clearly.  All of the 7th degree relatives have a minimum of zero, so there is no green bar.  Furthermore, at 5th cousins twice removed, the expected shared cMs drops to below 1, so the blue bar is nearly indistinguishable.

expected 9th to 17th

This last chart shows the smallest group, 9th through 17th degrees, or 4th through 8th cousins.

expected 9th to 17th no legend

On this final chart, we clearly see that Blaine’s actual red shared cMs and the purple maximum are significantly more pronounced than the blue expected shared cMs.  Some people share no DNA at this level, which is to be expected, but a non-trivial number of people share significantly more than is mathematically expected.  There are no absolutes.

Summary

DNA is not always inherited in the fashion or amount expected, and that wide variance is why we see what people believe are “false positive” relationship predictions. In reality, the best the vendors can do is to work with the averages.  This also explains why it’s so difficult for us to estimate or determine how a person might be connected based just on the relationship or generational prediction.  It’s just that, a prediction based on averages which may or may not reflect reality.

There’s a lot we don’t know yet about inheritance – why certain segments are passed on, often intact, sometimes for many generations, and some segments are not.  We don’t know how segments are “selected” for inheritance and we don’t yet know why some segments appear to be “sticky” meaning they show up more in descendants than other segments.

Close relationships are relatively easy, or easier, to predict, at least by relationship degree, but further distant ones are almost impossible to predict accurately based on either academic inheritance models or Blaine’s crowdsourced average cM information.

Here’s a clean copy of the combined chart for your use.

Note:  If you downloaded this chart on August 4, 2016, there was an error in the maximum number of first cousins trice removed.  On August 5, 2016, it was corrected to read 413.

Expected vs shared clean 4

Demystifying Ancestry’s Relationship Predictions Inspires New Relationship Estimator Tool

Today, I’m extremely pleased to bring you a wonderful guest article written by Karin Corbeil as spokesperson for a very fine group of researchers at www.dnaadoption.com.

I love it when citizen science really works, pushes the envelope, makes discoveries and then the scientists develop new tools!  This is a win-win for everyone in the genetic genealogy community – not just adoptees!  I want to say a very big thank you to this wonderful team for their fine work.

Take it away Karin….

As genetic genealogists we are always looking for a better “mousetrap”.  Tools and analyses that can better help us understand what we are actually looking at with our DNA results.  For adoptees and those with unknown ancestors it can be even more important.

When Ancestry came out with their “New Amount of Shared DNA” an explanation was necessary to understand what we were seeing.

We at DNAAdoption are asked to explain over and over again why your half-sibling was predicted as a 1st cousin, or that predicted Close Family – 1st cousin could actually be a half-nephew, or a predicted 3rd cousin could be a 4th cousin.  Ancestry doesn’t provide the detailed information needed to support their predicted relationship categories so providing the explanations was often a struggle.

We knew that you cannot draw or correlate any relationship inferences from either the total amount of shared DNA or the number of segments from the typical tools utilized by genetic genealogists because Ancestry’s totals will be lower and their segments will be broken into more pieces due to the removal of segments identified by the Timber algorithm as invalid matches.[1]

So in order to get a better reference to how predictions are set by Ancestry, we at DNAAdoption gathered data from 1,122 matches of different testers who had confirmed these matches as specific relationships. A collaborative effort was led by Richard Weiss of the DNAAdoption team.  Richard worked his magic with the data and the results are presented here.

A clip of the Pivot table from the data input:

Ancestry relationship table

The full data spreadsheet can be downloaded here:

Ancestry Predictions vs. Actual Relationships

Ancestry Predictions vs actual relationships

The most interesting thing about some of the prediction vs the actual relationships was seeing how more distant relationships can vary so greatly. Look at the 4th cousin prediction, for example. This varies from a half 1st cousin once removed to an 8th cousin once removed. (Obviously, this confirmed 8th cousin once removed probably has a persistent or intact segment that, due to the randomness of DNA down the generations, persisted for many generations). This makes it extremely difficult to assess any predicted relationship at the 4th cousin level. Even 1st, 2nd and 3rd cousin predictions had wide variances.

The only conclusion we can draw from this is to use Ancestry predictions with extreme caution.

With this data we were then able to take the numbers and add to our DNA Prediction Chart that we use in our DNA classes at DNAAdoption.

DNA Prediction Chart

DNA Prediction Chart 2

The full Excel spreadsheet can be downloaded here.

We then incorporated this data into our Relationship Estimator Tool created by Jon Masterson.

Jon explains, “This small program is intended to make the DNA Prediction Chart Spreadsheet a bit easier to use. It is based entirely on the data in this spreadsheet plus some interpolation of missing values. The algorithm to determine the most likely relationship(s) is very simple and based on summing the score of valid entries in the table for a given input. It is very much an experiment and test. It is likely to be less accurate with close relationships where there is missing data in the spreadsheet. You can also save the match information that you generate.”

First, download the zip file RelationshipEstimator.zip here.

Extract the files from the zip file and run the RelationshipEstimator.exe

relationship estimator

The following results are for the same person who has been confirmed as a 3rd cousin. The first set of data is from Gedmatch, the second set is from Ancestry. With this match the actual total cMs over 5 cMs are 122.9 with 5 segments; the same person shows Ancestry Shared DNA of 112 cMs with 7 segments.

For 23andMe/FTDNA/Gedmatch add the individual segment lengths in the first box using a slash “/” between each number.

At the “Source” box select 23andMe/FTDNA/Gedmatch, then click the “Process” button. Several possible estimated relationships will show.

Relationship estimator 2

For Ancestry, enter the total cMs, the # of segments.  At the “Source” box select “Ancestry”, then “Process”.

Relationship estimator 3

More information about this tool can be found here.

By seeing the larger variances with the Ancestry data (6 estimated relationships vs 3 for the actual Gedmatch data) we can only encourage those on Ancestry to upload your raw data file to Gedmatch. Of course, we still hope that one day Ancestry will release the full segment data in a chromosome browser.

We at DNAAdoption continue to try and provide analyses and tools, many times in cooperation with DNAGedcom, to give those searching for their roots better information. But we are “not for adoptees only” and provide this information for the genetic genealogy community as a whole.  We plan to add more data to these analyses in the near future.  We hope you will find it useful.

Your questions and comments are welcome.

Karin Corbeil (karincorbeil@gmail.com)

Diane Harman-Hoog (harmanhoog@gmail.com)

Richard Weiss (rnlweiss@gmail.com)

Jon Masterson (jon@scruffyduck.co.uk) 

[1] Roberta Estes, paraphrased from  https://dna-explained.com/2015/11/06/ancestrys-new-amount-of-shared-dna-what-does-it-really-mean/