This article covers both sorting in Excel and how to identify an overlapping segment, and what that means to you as a genetic genealogist.
I swore I wasn’t going to teach Excel, but there have been so many questions about sorting Excel spreadsheets that I am going to a very basic “how to sort and not hurt yourself” article. This does NOT replace actually understanding how to use Excel, but it will at least get you through the knothole of sorting for genetic genealogy.
I wrote more about sorting and filtering in the concepts article about assigning parental sides.
There are some advanced ways to accomplish the same thing, and I’m not discussing those. If you already know how to use Excel those are fine, but this article provides the basics for those who don’t.
I am going to use, as an example, my matches to only a few people which gives us enough information to sort, but isn’t overwhelming.
When you download your results from Family Tree DNA, your spreadsheet will be in match name order, like the spreadsheet below.
I want you to notice that while the primary order is by match, there is a secondary order too (chromosome), and a third (start location) and fourth (end location) as well.
Within each match, the order is by chromosome, and then by start and end location.
What this means that you can look at Alice and see that chromosome 1 is first, and that the lowest value start location is shown first within chromosome order.
That’s not the order you’ll likely be working with all the time, so let’s take a look at how to sort the spreadsheet in a different way.
The row highlighted in red contain column headers.
When you sort an individual column you will select the header for that column, shown below, if you’re going to sort the Matching SNPs column.
The cell on your spreadsheet won’t be red, but I’ve colored it red here so you can see that I’m selecting this column header and only this column header.
When you select a column header, you put the cursor on that cell and click once.
The cell you’ve selected will be bordered in black. A screen shot of my spreadsheet is shown above.
I want you to watch what happens to these two rows colored green when I sort in Matching SNP order.
At this point, you will click on the sort and filter button on the upper right hand side of the toolbar.
Here’s a closeup.
Selecting the “Sort A to Z” option sorts the contents of the entire spreadsheet in Matching SNP order, smallest to largest, because that’s the column header and sort option combination you selected. I use lowest to highest (A-Z) but you can also sort in reverse order, highest to lowest (Z-A) but that isn’t terribly useful for what we will be doing.
Notice that all of the rows are sorted into smallest to larger order by the Matching SNP column. So while the two green rows were originally together, now the rows all appear in order by the Matching SNPs column values.
The first green row match to Alice on chromosome 3 with 1300 cMs falls between the SNP value of 850 and 1458. The second green row with a value of 2000 falls between 1638 and 2355. This is exactly as it should be. The contents of the entire spreadsheet are sorted by the values in the Matching SNPs column.
The statement “sorts the contents of the entire spreadsheet” is very important, because if you perform this task incorrectly, you will bollux up your entire spreadsheet, as in irrecoverably and forever. What follows is an example of what NOT TO DO.
DO NOT DO THIS
DO NOT, and I repeat, DO NOT select the entire column to sort.
This is an example of WHAT NOT TO DO.
If you select the entire column, as shown above, then sort, here’s what happens.
Notice that the green rows are now split apart – in other words they no longer form a row from left to right. That means that ONLY the data in the Matching cM column was sorted, but not rest of the data which is still in the same location on the spreadsheet as it was before the sort. Therefore, Alice’s green row Matching cM value of 1300 is no longer with Alice, since only the data in the Matching SNPs column was sorted. Now Alice’s 1300 cMs connected to Stacy’s red row on chromosome 4. Alice now has 500 SNPs instead, which as you can see, clearly isn’t accurate.
This is what I meant by selecting the entire column instead of just the header will forever ruin your data. If you do this, there is no recovery, unless you JUST did it,
realize the error, and can selecte the blue backarrow on the top of the toolbar on the left to “undo” your action. If you’re beyond that, the only recovery is to download your data again, or move to a backup if you have one.
What’s even worse if you do this and don’t realize it, so you’re working with incorrect data trying to find overlapping segments. Of course, everything will be wrong. I periodically do a sanity check and look at a couple people in the chromosome browser just to make sure that everything is as it should be on my spreadsheet and I haven’t done something like this.
To Sort Correctly – DO This
To use this spreadsheet effectively for genetic genealogy, we need the spreadsheet to be sorted in this viewing order:
- Chromosome number
- Start location
- End location
In other words, we need the spreadsheet to look like this with all of the green cells remaining in their row with their match:
You’ll notice that all matches on each chromosome are grouped together, with the smallest start location first, as illustrated by the red groupings of chromsomes 1 and 6. I do realize these are small segments, but the process is the same for large or small segments, so for our sorting example, just ignore any genealogical relevance associated with segment size.
You will be looking for overlapping segments. Notice that you have to be cognizatnt of the end location. In the case of chromsome 1, above, there are no overlapping segments for the two chromsome one matches, so they can’t match each other on this segment.
However, on chromsome 6, we have a different situation. Stacy’s segment match with me is quite long, 104cM. Stacy’s segment overlaps with everyone else’s on chromsone 6 that matches to me, either fully or part way. She matches Alice on all of the segments fully except for the last one. Stacy’s match to me ends at 108,000,000. Alice’s last segment matches to me from 107,779,220 which is included in Stacy’s match, but Alice’s match extends beyond Stacys, to 110,175,307.
Keep in mind that we don’t know at this point whether or not Stacy and Alice are from my mother or father’s side, based on matching. In other words, to draw any conclusions, we also have to know if Stacy and Alice match each other on this segment which we can’t tell from this spreadsheet.
Because I have access to Stacy’s account, I can indeed tell you that Stacy and Alice do not match each other on this segment, so they would be from different sides of my family tree. Stacy is a known relative from my father’s side and Alice does match my mother as well, so we now know that Stacy and Alice don’t match each other.
If you don’t have access to the accounts to see if your matches match each other, two tools at Family Tree DNA are partial substitutes.
- The ICW tool tells you if two of your matches match each other, just not on which segments.
- The maternal/paternal Family Matching tool, if you have connected the DNA of relatives who have tested, tell you which side your matches are from, maternal or paternal.
You can read about how to use those tools here.
If there are multiple matches with the smallest start location then they will be in order by the smallest end location first, shown in the yellow cells.
The sort order is exactly the opposite of the viewing order. If you want to SEE the data in this order:
Then you must sort in this order:
The last column you sort will be the primary viewing order.
Let’s look at our spreadsheet utilizing these three steps, in order.
Step 1 – First Sort
Selecting End Location to sort:
After sorting by end location, below.
You will notice that all of the data is now in order by the values in the End Location column – smallest at the top, largest at the bottom.
The data in the other columns is not in any particular order at all.
Step 2 – Second Sort
Now selecting Start Location to sort that column in order, shown below.
Having sorted by Start Location, below:
You will notice that now all of the data is sorted by start location. In the case where there is a common start location between two rows, highlighted in red, the end row with the lower end location will show first, noted in yellow, because you sorted first by end location in smallest to largest order.
Step 3 – Third Sort
Last, you’ll select the Chromosome column header to sort in chromosome order.
Below, the result of sorting the third time in chromsome order. After sorting, I bordered all segments on the same chromosome.
You can see that the entire spreadsheet is grouped by chromsome, and within chromsome number, the Start Location is grouped smallest to largest. If there are multiple people with the same start location, then the End Location comes into play, with the smallest end location listed first, as shown in the red and yellow rows.
If you want to sort your spreadsheet in another order for some reason, you can do so using the same methodology. Once you understand about sorting spreadsheets, you understand about sorting all spreadsheets.
Now, you’re ready to look for your overlapping segments.
What is an Overlap?
An overlap is two segments of your matches that are partially or completely overlapping each other. When you have overlapping segments, assuming they are of decent size, that indicates that the two people who match you on your spreadsheet potentially match each other too. Remember, there are three matching possibilities:
- Your matches will either match each other, in addition to you, because you and both of them share a common ancestor or…
- They both match you, but they won’t match each other because one is from your mother’s side and one is from your father’s side or…
- One or both are identical by chance. In you need a refresher on what identical by chance, descent and population mean, click here.
In this first example, above, there is no overlap between these two people on chromosome 17. One begins at 31,000,000 and ends at 36,000,000 while the second person’s match with you doesn’t begin until 40,000,000, which is clearly beyond the end of 36,000,000, so there is no possibility of overlaps between these two individuals. In other words, they cannot match each other on these segments. However, clearly they both match you because they are both on your matching spreadsheet.
In the example above, the overlapping portion of the segment is from 38,000,000 – 40,000,000. The second person’s match with you extends to 53,000,000, but the area between 40,000,000 and 53,000,000 does not overlap.
In the example above, the start number is lower for the top row than the second row, so the overlapping area is still from 38,000,000 – 40,000,000, because the matches don’t match from 36,000,000 to 38,000,000.
Occasionally, you have an overlap that is fairly miniscule, which I generally ignore unless they are in a group that has a larger overlap that overlaps or covers both smaller matches, as in the example above. You can see that our red and yellow rows have a very small overlap from 39,500,000 – 40,000,000. However, the top row includes the entire areas of both red and yellow rows, reaching from 33,000,000 to 55,000,000 which begins before either red/yellow row and ends after both red/yellow rows. So either all 3 individuals will match each other, indicating a common ancestor, or the top row will match one of the red/yellow rows and not the other.
Combining Spreadsheets From Different Sources
The good news is that you can download your matches into a spreadsheet format from 23andMe, Family Tree DNA and GedMatch, but you do need to understand something about the basics of sorting and how to stay out of spreadsheet trouble. I am careful about combining spreadsheets sources for a couple of reasons.
- First, the formatting is not exactly the same, so you may need to move columns to be in the correct order for your spreadsheet before actually combining them.
- Second, there may be overlapping people between 23andMe, Family Tree DNA and GedMatch. You’ll need to figure out how you want to deal with that, especially on an ongoing basis when you need to add to or update your spreadsheet without overwriting or eliminating your matching work and notes relative to common ancestors and ancestral lines in the columns you’ll be adding.
I always make a backup file with a date name in the file name before doing combinations, and sometimes before sorting as well.
If you want to learn more about how to use Excel, here are some additional resources to utilize.
I found some training videos for Excel including “Twenty with Tessa, Tips and Suggestions for Spreadsheets” which is focused on using spreadsheets with one name studies and genetic genealogy, but the principles are the same. https://www.youtube.com/watch?v=Ll_cfhOZTl0&feature=youtu.be
When discussing this online, one person mentioned that they joined www.lynda.com and took the basic Excel class which she found very useful.
Kitty Cooper has instructions on her blog for how to make a matches spreadsheet as well.
www.DNAadoption.com has some good courses. Their DNA for beginners covers using spreadsheets and is not just for adoptees!