Concepts – Sorting Spreadsheets for Autosomal DNA

This article covers both sorting in Excel and how to identify an overlapping segment, and what that means to you as a genetic genealogist.

I swore I wasn’t going to teach Excel, but there have been so many questions about sorting Excel spreadsheets that I am going to a very basic “how to sort and not hurt yourself” article. This does NOT replace actually understanding how to use Excel, but it will at least get you through the knothole of sorting for genetic genealogy.

I wrote more about sorting and filtering in the concepts article about assigning parental sides.

There are some advanced ways to accomplish the same thing, and I’m not discussing those. If you already know how to use Excel those are fine, but this article provides the basics for those who don’t.

Sorting

I am going to use, as an example, my matches to only a few people which gives us enough information to sort, but isn’t overwhelming.

When you download your results from Family Tree DNA, your spreadsheet will be in match name order, like the spreadsheet below.

SS Raw

I want you to notice that while the primary order is by match, there is a secondary order too (chromosome), and a third (start location) and fourth (end location) as well.

Within each match, the order is by chromosome, and then by start and end location.

What this means that you can look at Alice and see that chromosome 1 is first, and that the lowest value start location is shown first within chromosome order.

That’s not the order you’ll likely be working with all the time, so let’s take a look at how to sort the spreadsheet in a different way.

The row highlighted in red contain column headers.

SS column headers

When you sort an individual column you will select the header for that column, shown below, if you’re going to sort the Matching SNPs column.

SS Column select

The cell on your spreadsheet won’t be red, but I’ve colored it red here so you can see that I’m selecting this column header and only this column header.

When you select a column header, you put the cursor on that cell and click once.

SS column select 2

The cell you’ve selected will be bordered in black.  A screen shot of my spreadsheet is shown above.

I want you to watch what happens to these two rows colored green when I sort in Matching SNP order.

SS rows green

At this point, you will click on the sort and filter button on the upper right hand side of the toolbar.

SS sort dropdown

Here’s a closeup.

SS sort dropdown closeup

Selecting the “Sort A to Z” option sorts the contents of the entire spreadsheet in Matching SNP order, smallest to largest, because that’s the column header and sort option combination you selected. I use lowest to highest (A-Z) but you can also sort in reverse order, highest to lowest (Z-A) but that isn’t terribly useful for what we will be doing.

SS SNP column sorted

Notice that all of the rows are sorted into smallest to larger order by the Matching SNP column. So while the two green rows were originally together, now the rows all appear in order by the Matching SNPs column values.

The first green row match to Alice on chromosome 3 with 1300 cMs falls between the SNP value of 850 and 1458.  The second green row with a value of 2000 falls between 1638 and 2355.  This is exactly as it should be.  The contents of the entire spreadsheet are sorted by the values in the Matching SNPs column.

The statement “sorts the contents of the entire spreadsheet” is very important, because if you perform this task incorrectly, you will bollux up your entire spreadsheet, as in irrecoverably and forever.  What follows is an example of what NOT TO DO.

DO NOT DO THIS

DO NOT, and I repeat, DO NOT select the entire column to sort.

SS - Do Not Sort

This is an example of WHAT NOT TO DO.

If you select the entire column, as shown above, then sort, here’s what happens.

SS example bad sort

Notice that the green rows are now split apart – in other words they no longer form a row from left to right. That means that ONLY the data in the Matching cM column was sorted, but not rest of the data which is still in the same location on the spreadsheet as it was before the sort. Therefore, Alice’s green row Matching cM value of 1300 is no longer with Alice, since only the data in the Matching SNPs column was sorted. Now Alice’s 1300 cMs connected to Stacy’s red row on chromosome 4. Alice now has 500 SNPs instead, which as you can see, clearly isn’t accurate.

This is what I meant by selecting the entire column instead of just the header will forever ruin your data. If you do this, there is no recovery, unless you JUST did it, SS undo
realize the error, and can selecte the blue backarrow on the top of the toolbar on the left to “undo” your action. If you’re beyond that, the only recovery is to download your data again, or move to a backup if you have one.

What’s even worse if you do this and don’t realize it, so you’re working with incorrect data trying to find overlapping segments.  Of course, everything will be wrong.  I periodically do a sanity check and look at a couple people in the chromosome browser just to make sure that everything is as it should be on my spreadsheet and I haven’t done something like this.

To Sort Correctly – DO This

To use this spreadsheet effectively for genetic genealogy, we need the spreadsheet to be sorted in this viewing order:

  • Chromosome number
  • Start location
  • End location

In other words, we need the spreadsheet to look like this with all of the green cells remaining in their row with their match:

SS example good sort

You’ll notice that all matches on each chromosome are grouped together, with the smallest start location first, as illustrated by the red groupings of chromsomes 1 and 6. I do realize these are small segments, but the process is the same for large or small segments, so for our sorting example, just ignore any genealogical relevance associated with segment size.

You will be looking for overlapping segments. Notice that you have to be cognizatnt of the end location. In the case of chromsome 1, above, there are no overlapping segments for the two chromsome one matches, so they can’t match each other on this segment.

However, on chromsome 6, we have a different situation. Stacy’s segment match with me is quite long, 104cM. Stacy’s segment overlaps with everyone else’s on chromsone 6 that matches to me, either fully or part way. She matches Alice on all of the segments fully except for the last one. Stacy’s match to me ends at 108,000,000. Alice’s last segment matches to me from 107,779,220 which is included in Stacy’s match, but Alice’s match extends beyond Stacys, to 110,175,307.

Keep in mind that we don’t know at this point whether or not Stacy and Alice are from my mother or father’s side, based on matching. In other words, to draw any conclusions, we also have to know if Stacy and Alice match each other on this segment which we can’t tell from this spreadsheet.

Because I have access to Stacy’s account, I can indeed tell you that Stacy and Alice do not match each other on this segment, so they would be from different sides of my family tree. Stacy is a known relative from my father’s side and Alice does match my mother as well, so we now know that Stacy and Alice don’t match each other.

If you don’t have access to the accounts to see if your matches match each other, two tools at Family Tree DNA are partial substitutes.

  • The ICW tool tells you if two of your matches match each other, just not on which segments.
  • The maternal/paternal Family Matching tool, if you have connected the DNA of relatives who have tested, tell you which side your matches are from, maternal or paternal.

You can read about how to use those tools here.

If there are multiple matches with the smallest start location then they will be in order by the smallest end location first, shown in the yellow cells.

Sort Order

The sort order is exactly the opposite of the viewing order. If you want to SEE the data in this order:

  • Chromosome
  • Start
  • End

Then you must sort in this order:

  • End
  • Start
  • Chromosome

The last column you sort will be the primary viewing order.

Let’s look at our spreadsheet utilizing these three steps, in order.

Step 1 – First Sort

Selecting End Location to sort:

SS sort end location

After sorting by end location, below.

SS end location sorted

You will notice that all of the data is now in order by the values in the End Location column – smallest at the top, largest at the bottom.

The data in the other columns is not in any particular order at all.

Step 2 – Second Sort

Now selecting Start Location to sort that column in order, shown below.

SS sort by start location

Having sorted by Start Location, below:

SS sorted by start location

You will notice that now all of the data is sorted by start location. In the case where there is a common start location between two rows, highlighted in red, the end row with the lower end location will show first, noted in yellow, because you sorted first by end location in smallest to largest order.

Step 3 – Third Sort

Last, you’ll select the Chromosome column header to sort in chromosome order.

Sort by chromosome

Below, the result of sorting the third time in chromsome order.  After sorting, I bordered all segments on the same chromosome.

Sorted by chromosome

You can see that the entire spreadsheet is grouped by chromsome, and within chromsome number, the Start Location is grouped smallest to largest. If there are multiple people with the same start location, then the End Location comes into play, with the smallest end location listed first, as shown in the red and yellow rows.

If you want to sort your spreadsheet in another order for some reason, you can do so using the same methodology. Once you understand about sorting spreadsheets, you understand about sorting all spreadsheets.

Now, you’re ready to look for your overlapping segments.

What is an Overlap?

An overlap is two segments of your matches that are partially or completely overlapping each other.  When you have overlapping segments, assuming they are of decent size, that indicates that the two people who match you on your spreadsheet potentially match each other too.  Remember, there are three matching possibilities:

  • Your matches will either match each other, in addition to you, because you and both of them share a common ancestor or…
  • They both match you, but they won’t match each other because one is from your mother’s side and one is from your father’s side or…
  • One or both are identical by chance.  In you need a refresher on what identical by chance, descent and population mean, click here.

Ss no overlap

In this first example, above, there is no overlap between these two people on chromosome 17.  One begins at 31,000,000 and ends at 36,000,000 while the second person’s match with you doesn’t begin until 40,000,000, which is clearly beyond the end of 36,000,000, so there is no possibility of overlaps between these two individuals.  In other words, they cannot match each other on these segments.  However, clearly they both match you because they are both on your matching spreadsheet.

SS overlap 1

In the example above, the overlapping portion of the segment is from 38,000,000 – 40,000,000.  The second person’s match with you extends to 53,000,000, but the area between 40,000,000 and 53,000,000 does not overlap.

SS overlap 2

In the example above, the start number is lower for the top row than the second row, so the overlapping area is still from 38,000,000 – 40,000,000, because the matches don’t match from 36,000,000 to 38,000,000.

SS overlap 3

Occasionally, you have an overlap that is fairly miniscule, which I generally ignore unless they are in a group that has a larger overlap that overlaps or covers both smaller matches, as in the example above. You can see that our red and yellow rows have a very small overlap from 39,500,000 – 40,000,000. However, the top row includes the entire areas of both red and yellow rows, reaching from 33,000,000 to 55,000,000 which begins before either red/yellow row and ends after both red/yellow rows.  So either all 3 individuals will match each other, indicating a common ancestor, or the top row will match one of the red/yellow rows and not the other.

Combining Spreadsheets From Different Sources

The good news is that you can download your matches into a spreadsheet format from  23andMe, Family Tree DNA and GedMatch, but you do need to understand something about the basics of sorting and how to stay out of spreadsheet trouble. I am careful about combining spreadsheets sources for a couple of reasons.

  • First, the formatting is not exactly the same, so you may need to move columns to be in the correct order for your spreadsheet before actually combining them.
  • Second, there may be overlapping people between 23andMe, Family Tree DNA and GedMatch. You’ll need to figure out how you want to deal with that, especially on an ongoing basis when you need to add to or update your spreadsheet without overwriting or eliminating your matching work and notes relative to common ancestors and ancestral lines in the columns you’ll be adding.

I always make a backup file with a date name in the file name before doing combinations, and sometimes before sorting as well.

Learning Excel

If you want to learn more about how to use Excel, here are some additional resources to utilize.

I found some training videos for Excel including “Twenty with Tessa, Tips and Suggestions for Spreadsheets” which is focused on using spreadsheets with one name studies and genetic genealogy, but the principles are the same.  https://www.youtube.com/watch?v=Ll_cfhOZTl0&feature=youtu.be

When discussing this online, one person mentioned that they joined www.lynda.com and took the basic Excel class which she found very useful.

Kitty Cooper has instructions on her blog for how to make a matches spreadsheet as well.

www.DNAadoption.com has some good courses.  Their DNA for beginners covers using spreadsheets and is not just for adoptees!

10 thoughts on “Concepts – Sorting Spreadsheets for Autosomal DNA

  1. I used to have to teach these things also. What I used to tell people was to make sure that when you hit the wrong button and it makes you say, “Oh, crap!” don’t start clicking around wildly to fix it. SIT ON YOUR HANDS! Once you calm down, you can hit the “undo” button, but if you panic, you’re SOL.

  2. I always wondered why you sorted in reverse order. Now I know why. However, it is a lot easier to select the sort and filter menu, then select custom sort. In the first sort item use “sort by” “chromosome”. Then select “add level” and in the “then by” select “Start Location.” The final step is to select “add level” and in the new “then by” block select “End Location.” The other columns should read to sort on values and “smallest to largest.” This process does the same as your process, but does it in a single sort process. You can also filter to hide all lines with a cM value below 7 or some other value. I often use 10 cM to thin out the list of folks to look at. (The filter does not delete the other records, but simply hides them.
    After sorting, I would suggest saving to an XLSX file and not back to the CSV file that you likely downloaded. You can also use a table graphic to display the overlaps for each chromosome, but it is a bit of a pain to do in Excel and not something I am great at. You have to make some changes to the display and such for it to be useful. This is why I wrote a program to do the sort, filter and display. The end result is a file in TXT format (can be imported to Excel) with the names and such. The display gives a choice of various formats (I usually used PDF and then combine the graphics for all the chromosomes in Acrobat to get a single graphic file. Yes, I go back to the TXT file and ICW (as well as family trees) to identify paternal/maternal and potential MRCA. However, the graphic is nice to visualize the overlaps.

  3. Pingback: Concepts – Managing Autosomal DNA Matches – Step 2 – Updating Match Spreadsheets, Bucketed Family Finder Matches and Pileups | DNAeXplained – Genetic Genealogy

  4. There’s many different ways to sort. I go to the “Data” tab on the top line menu in Excel and chose “Sort” to get to the custom sort dialog box. I agree with wadavis that it is potentially easier and clearer to sort using the dialog box then the other method described.

  5. Pingback: 2016 Genetic Genealogy Retrospective | DNAeXplained – Genetic Genealogy

  6. Pingback: The Concepts Series | DNAeXplained – Genetic Genealogy

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s