The Autosomal Me – Extracting Data Segments and Clustering

This is Part 8 of a multi-part series, “The Autosomal Me.”

Part 1 was “The Autosomal Me – Unraveling Minority Admixture” and Part 2 was “The Autosomal Me – The Ancestors Speak.”  Part 1 discussed the technique we are going to use to unravel minority ancestry, and why it works.  Part two gave an example of the power of fragmented chromosomal mapping and the beauty of the results.

Part 3, “The Autosomal Me – Who Am I?,” reviewed using our pedigree charts to gauge expected results and how autosomal results are put into population buckets.  Part 4, “The Autosomal Me – Testing Company Results,” shows what to expect from all of the major testing companies, past and present, along with Dr. Doug McDonald’s analysis.  In Part 5, “The Autosomal Me – Rooting Around in the Weeds Using Third Party Tools,” we looked at 5 different third party tools and what they can tell us about our minority admixture that is not reported by the major testing companies because the segments are too small and fragmented.

In Part 6, “The Autosomal Me – DNA Analysis – Splitting Up” we began the analysis part of the data we’ve been gathering.   We looked at how to determine whether minority admixture on specific chromosomes came from which parent.

Part 7, “The Autosomal Me – Start, Stop, Go – Identifying Native Chromosomal Segments”, took a deeper dive and focused on the two chromosomes with proven Native heritage and began by comparing those chromosome segments using the 4 GedMatch admixture tools.

In this segment, Part 8, we’ll be extracting all of the Native and Blended Asian segments on all 22 chromosomes, but I’ll only be using chromosomes 1 and 2 for illustration purposes.  We will then be clustering the resulting data to look for trends.  If you’re following along and using this methodology, you’ll be extracting the Native segment start and stop locations from all 22 chromosomes.

I apologize in advance for the length of this article, but there was just no good place to break it into pieces.

So, let’s get started.  As a reminder, we are using the admixture tools at www.gedmatch.com.

I experimented with several types of extractions to see which ones best reflected the results found by both 23andMe and Dr. McDonald and confirmed by the start and stop segments in the highly Native segments of chromosomes 1 and 2 in Part 7 of this series.  We verified that all 4 tools accurately reflected and corroborated the segments listed as Native, so now we’re going to apply that same methodology to the rest of our chromosomal data.

Initially, I tried to use the information from chromosomes 1 and 2 to extract the Native chromosomes using only the “best” tool, but when I looked at all 4 tools, I quickly realized that there was no single “best” choice.  A couple of crucial points came to light.

  • Some of the geographic colors are almost impossible to tell apart.
  • None of the tools are universally best.
  • When looking at all 4 tools, generally a “best 3 out of 4” approach allowed for one of the tools to be wrong, to perhaps reference a slightly different data base that called the segment differently or for the colors to be indistinguishable.  In other words, if three called a segment Native and one did not, it’s Native and conversely, if less than 3 call it Native, in this comparison, it’s not.

Unfortunately, this created an awful lot of work.  This is probably the best example of where automation tools could and would make a huge difference in this process.

I did two separate extracts.  The first one is what I refer to as the “Strong Native” extract and the second is the “Blended Asian.”  In part, I did these separately as a check and balance to be sure that my first extraction was accurate.

In the first extract, I selected only one category, the one best fitted to “Native American” for each tool.  I used the following categories for each admixture tool:

  • MDLP – Amerind
  • Eurogenes – North Amerindian
  • Dodecad – NE Asian
  • Harrappaworld – American

I completed this process for every chromosome, but I’m only showing the first two chromosomes in this article.

By way of example, using the first tool, MDLP, North Amerind looks black, but is actually very dark grey.  It is, fortunately, distinctive.

On the chromosome painting below, my results for the first part of chromosome 1 are shown in the first band, and mother’s for the same segment are shown as the second band.  The bottom band represents common segments and the black is non-matching segments, meaning those I obtained from my father.  Sometimes this third band can help you determine what you are really seeing in terms of colors and blending, but it’s not always useful.  In this case, trying to spot a small amount of dark gray against black is almost impossible, so not terribly helpful.  But if you were looking for red, that would be another story.  As you move through this process, remember, it’s not exact and utilizing best 3 of 4 will help you recover from any major errors.

You can see that my grey segments show up from about 12-13 and then again at about 14.5.  Sometimes it’s difficult to know how to count something.  For example, my Native at 14.5 – it’s actually more like 14.25 -14.5, but I chose not to divide further than half mb segments.  As long as you are consistent in whatever methodology you select, it will work out.

step 8 - 1

Please note that when reading these charts, that the small hash mark is the indicator for the measure.  In other words, the small hash mark above 10M means that is the 10M location.  It’s obvious here, but on some charts, the hash mark and the location legend look to be 1-off.  Again, as long as you’re consistent, it really doesn’t matter.

Mother’s Native segments are more pronounced and obvious.  They range from about 8-14.  Using the actual tools, you would record this and then continue scrolling to the right until you reach the end of the chromosome.  On chromosomes 1 and 2, I found the strong Native segments for the four admixture tools, as shown below.

The boxed numbers show the areas that were found “in common” between 23andMe, Dr. McDonald and the admixture tools, as determined in Part 7 of this series.  Highlighted segments show segments where at least 3 of 4 admixture tools reported Native heritage.  As you can see, there were clearly additional Native segments not reported by 23andme and Dr. McDonald.

Strong Native Chromosomal Detail Table

step 8 - 2

step 8 - 3

Because we have both my and mother’s results, we can infer my father’s contribution.  Clearly, some of his will wind up being some amount of “noise” and some IBS segments, but not all, by any means, and this is the only way to get a “read” on Dad.  This is one form of phasing data.  Phasing refers to various methodologies of figuring out which DNA comes from what source, meaning which parental line.

While the strongest Native segments are the ones individually most likely to indicate Native American ancestry, that really isn’t the whole story.  I discovered that many of these Native segments are actually embedded in other segments that are indicative of Native heritage too.  In other words, it’s not a line in the sand, yes or no, but more of a sliding scale.

On the chromosome painting below, this one using Eurogenes, with my results shown above and mother’s below, you can see two excellent examples.  Regions relevant to Native ancestry include:

  • Red – South Asian
  • Brown – Southwest Asian
  • Yellow – North Amerindian and      Arctic
  • Putty – Siberian
  • Emerald – East Asian

You can see that while mine is almost universally yellow, or Native, with a little Siberian (putty) mixed in for good measure between 169-170, a hint of East Asian (emerald) plus a little Asian (red), mother’s isn’t.  In fact, hers is a mixture of Native American and South Asian (red), with more red than yellow,  Siberian (putty) and a large segment of East Asian (emerald green).

step 8 - 4A

While her yellow Native segments alone would be staggered across this entire segment in 7 different pieces, when taken together as a whole, the “blended Asian” segment reaches entirely across the screen with the exception of 1 mb between 161.5-162.5, roughly.

The following Blended Asian Chromosomal Detail Table shows all of the blended Asian segments using all four of the admixture tools for chromosomes 1 and 2.

It’s clear that these regions are not solely “Native American” but reach back in time genetically into Asia, particularly Northeast Asia.

Again, the boxed numbers show the “in common” segments between all tools and the yellow highlighted segments are common between at least three of the four admixture tools.

Please note that there were some issues distinguishing colors, as follows:

  • For the MDLP comparison, Mesoamerican and Paleo Siberian are both putty colored and indistinguishable on the chart.  Also, the apple green for Arctic Amerind is very similar to the Austronesian.
  • When using Dodecad, Southeast Asian (light green) and South Asian (apple green) are nearly impossible to distinguish from each other on the graphs.
  • When using HarappaWorld, the apple green for Siberian was very similar to the light forest green for Papua New Guinea and was very difficult to distinguish.  The South Asian putty appears often with the other Native markers, and I considered including this group, but it too was difficult to distinguish from other regions so in the end, I opted not to include this category.
  • If you are colorblind – get help as this is impossible otherwise.

Blended Asian Chromosomal Detail Table

On the blended Asian Chromosome Detail Table, I added yellow highlighting where the same segments show in other Asian geographies that showed in the Strong Native table.  In each column, the Strong Native category is the last one at the bottom of the list.

The blue highlighting shows other common segments found that were not included in the Strong Native segments.  For a Strong Native yellow segment to be highlighted, it had to be present in 3 of 4 tools, or 75%.  In the Blended Asian group, there are a total of 15 categories between the 4 admixture tools, so for a segment to be shaded blue, it must be found in at least 8 of the categories, so just over half.  There are many segments that are found in several categories across the tools.  For example, segment 192-193 on chromosome 1 is found five times.  This isn’t to say you should discount this segment, only that it isn’t one of the strongest, most universal.  Surprisingly, there really weren’t too many that were close to the cutoff.  Several, but not a majority, were in the 4 or 5 range, only one was at 7.

step 8 - 4

 step 8 - 5

step 8 - 6

 step 8 - 7

  step 8 - 8step 8 - 9

 step 8 - 10

 Step 8 - 11

step 8 - 12

Clustering

The third step in data extraction is to look at all of the data together.  In this step, we are removing the geographic boundaries of Siberian, N. Amerindian, etc. and combining all of our data.  I have only combined the data within columns, not between columns, so we can get a feel for which tool or tools performed best or maybe not so well.  Each chromosome in each column has its data ordered numerically, and yes, this is a manual cut and paste process.  Sorry.  I warned you, this is an very manually intensive process.

After I put each column in numerical order, I arranged them so that the numbers were approximately in a line, or a row, with each other.  For example, in the first group below, you can clearly see that the first cluster of results is found using all 4 tools.  When looked at individually, only the blue results were noted as common (at least 8 of 15 for blue), but when viewed as a cluster, you can see between the tools that the cluster itself runs from about 7.5, with a small break from 8-9, and then to about 14.5.  As you would expect the beginning and end points of the cluster trail off and are not uniform between tools, but the main part of the cluster is found in all the tools.  This introduces the question of how to measure a cluster.  In this case, there is a clean break using all tools between 8 and 9, but that is only 1 mb, rather difficult to measure accurately.  You could record this as two distinct clusters but since it’s very closely adjacent the rest of the cluster, I’m inclined to include this as one large cluster and use the starting and ending segments for the cluster as a whole, in other words, the cluster runs from 7.5 through 14.5.  The alternate, or more conservative methodology would be to use the “in common” numbers, but in this case, that would be only 10-11.5 and I think you would miss a great deal of useful data.  So, for clusters, I’m recording the full extent of the cluster.  In some cases, you may need to exercise a judgment call.

Let’s look at the second group of numbers, beginning with 18.5 in Harrappaworld.  This grouping runs though about 28.  Eurogenes found some blended Asian between 27-28.5 as well in two of the geographies, but over all, of the 15 tools, we don’t see much.  This could be a result of a number of things.  I could have had problems with the colors, there may be only a very small amount and it may be categorized as something else with the other tools.  I would not consider this a cluster, and using our best 3 or 4 methodology eliminates this cluster from consideration.  This also holds true for 43-43.5.

However, the next cluster, from 55.5 to 58 is found in the Strong Native comparison, indicated by the yellow highlighting and is found using all 4 tools.  This is definitely a cluster.

step 8 - 13

step 8 - 14

step 8 - 15

step 8 - 16

step 8 - 17

step 8 - 18

step 8 - 19

Step 8 - 20

step 8 - 21

step 8 - 22

step 8 - 23

step 8 - 24

I’ve synthesized the cluster information into a list.  From the clusters above, I’ve created a list that I will be using in the next segment for data input into my spreadsheet of matches.  The blended segments below that include Strong Native segments are shown with yellow.

step 8 - 25

Using the GedMatch admixture applications, we’ve isolated the strongest Native and the Blended Asian segments and clusters in preparation for identifying specific Native family lines within our group of matches.

This process shows that, for the most part, the Strong Native segments picked up the strongest signals, about half of the segments that will be useful in determining Native admixture, although it does miss some.

When we use the clustering technique to view our results across all the admixture tools, we see a somewhat different picture emerge, adding several Blended Asian clusters.

In Part 9 of this series, we will use the highlighted Strong Native segments and the Blended Asian clusters, both of which suggest Native chromosomal “hotspots” to begin our comparison to our genetic matches for genealogical relevance.  In other words, using this information, we will determine which genealogical lines carry Native ancestry.

Part 9 may be somewhat delayed.  The good news is that Family Tree DNA is finishing work on their Build 36 to Build 37 conversion.  The bad news is that it fell right in the middle of writing this series.  When they finish Build 37, I’ll finish Part 9 of this series.  In the mean time, you can be extracting your minority segments using the tools and techniques that we have covered in Parts 1-8.

______________________________________________________________

Disclosure

I receive a small contribution when you click on some of the links to vendors in my articles. This does NOT increase the price you pay but helps me to keep the lights on and this informational blog free for everyone. Please click on the links in the articles or to the vendors below if you are purchasing products or DNA testing.

Thank you so much.

DNA Purchases and Free Transfers

Genealogy Services

Genealogy Research