Determining Ethnicity Percentages

Recently, as a comment to one of my blog postings, someone asked how the testing companies can reach so far back in time and tell you about your ancestors.  Great question.

The tests that reliably reach the furthest back, of course, are the direct line Y-Line and mitochondrial DNA tests, but the commenter was really asking about the ethnicity predictions.  Those tests are known as BGA, or biogeographical ancestry tests, but most people just think of them or refer to them as the ethnicity tests.

Currently, Family Tree DNA, 23andMe and Ancestry.com all provide this function as a part of their autosomal product along with the Genographic 2.0 test.  In addition, third party tools available at www.gedmatch.com don’t provide testing, but allow you to expand what you can learn with their admixture tools if you upload your raw data files to their site.  I wrote about how to use these ethnicity tools in “The Autosomal Me” series.  I’ve also written about how accurate ethnicity predictions from testing companies are, or aren’t, here, here and here.

But today, I’d like to just briefly review the 3 steps in ethnicity prediction, and how those steps are accomplished.  It’s simple, really, in concept, but like everything else, the devil is in the details.devil

There are three fundamental steps.

  • Creation of the underlying population data base.
  • Individual DNA extraction.
  • Comparison to the underlying population data base.

Step 1:  Creation of the underlying population data base.

Don’t we wish this was as simple as it sounds.  It isn’t.  In fact, this step is the underpinnings of the accuracy of the ethnicity predictions.  The old GIGO (garbage in, garbage out) concept applies here.

How do researchers today obtain samples of what ancestral populations looked like, genetically?  Of course, the evident answer is through burials, but burials are not only few and far between, the DNA often does not amplify, or isn’t obtainable at all, and when it is, we really don’t have any way to know if we have a representative sample of the indigenous population (at that point in time) or a group of travelers passing through.  So, by and large, with few exceptions, ancient DNA isn’t a readily available option.

The second way to obtain this type of information is to sample current populations, preferably ones in isolated regions, not prone to in-movement, like small villages in mountain valleys, for example, that have been stable “forever.”  This is the approach the National Geographic Society takes and a good part of what the Genograpic Geno 2.0 project funding does.  Indigenous populations are in most cases our most reliable link to the past.  These resources, combined with what we know about population movement and history are very telling.  In fact, National Geographic included over 75,000 AIMs (Ancestrally Informative Markers) on the Geno 2.0 chip when it was released.

The third way to obtain this type of information is by inference.  Both Ancestry.com and 23andMe do some of this.  Ancestry released its V2 ethnicity updates this week, and as a part of that update, they included a white paper available to DNA participants.  In that paper, Ancestry discusses their process for utilizing contributed pedigree charts and states that, aside from immigrant locations, such as the United States and Canada, a common location for 4 grandparents is sufficient information to include that individuals DNA as “native” to that location.  Ancestry used 3000 samples in their new ethnicity predictions to cover 26 geographic locations.  That’s only 115 samples, on average, per location to represent all of that population.  That’s pretty slim pickins.  Their most highly represented area is Eastern Europe with 432 samples and the least represented is Mali with 16.  The regions they cover are shown below.

ancestry v2 8

Survey Monkey, a widely utilized web survey company, in their FAQ about Survey Size For Accuracy provides guidelines for obtaining a representative sample.  Take a look.  No matter which calculations you use relative to acceptable Margin of Error and Confidence Level, Ancestry’s sample size is extremely light.

23andMe states in their FAQ that their ethnicity prediction, called Ancestry Composition covers 22 reference populations and that they utilize public reference datasets in addition to their clients’ with known ancestry.

23andMe asks geographic ancestry questions of their customers in the “where are you from” survey, then incorporates the results of individuals with all 4 grandparents from a particular country.  One of the ways they utilize this data is to show you where on your chromosomes you match people whose 4 grandparents are from the same country.  In their tutorial, they do caution that just because a grandparent was born in a particular location doesn’t necessarily mean that they were originally from that location.  This is particularly true in the past few generations, since the industrial revolution.  However, it may still be a useful tool, when taken with the requisite grain of salt.

23andme 4 grandparents

The third way of creating the underlying population data base is to utilize academically published information or information otherwise available.  For example, the Human Genome Diversity Project (HGDP) information which represents 1050 individuals from 52 world populations is available for scrutiny.  Ancestry, in their paper, states that they utilized the HGDP data in addition to their own customer database as well as the Sorenson data, which they recently purchased.

Academically published articles are available as well.  Family Tree DNA utilizes 52 different populations in their reference data base.  They utilize published academic papers and the specific list is provided in their FAQ.

As you can see, there are different approaches and tools.  Depending on which of these tools are utilized, the underlying data base may look dramatically different, and the information held in the underlying data base will assuredly affect the results.

Step 2:  Your Individual DNA Extraction

This is actually the easy part – where you send your swab or spit off to the lab and have it processed.  All three of the main players utilize chip technology today.  For example, 23andMe focuses on and therefore utilizes medical SNPs, where Family Tree DNA actively avoids anything that reports medical information, and does not utilize those SNPs.

In Ancestry’s white paper, they provide an excellent graphic of how, at the molecular level, your DNA begins to provide information about the geographic location of your ancestors.  At each DNA location, or address, you have two alleles, one from each parent.  These alleles can have one of 4 values, or nucleotides, at each location, represented by the abbreviations T, A, C and G, short for Thymine, Adenine, Cytosine and Guanine.  Based on their values, and how frequently those values are found in comparison populations, we begin to fine correlations in geography, which takes us to the next step.

ancestry allele snps

Step 3:  Comparison to Underlying Population Data Base

Now that we have the two individual components in our recipe for ethnicity, a population reference set and your DNA results, we need to combine them.

After DNA extraction, your individual results are compared to the underlying data base.  Of course, the accuracy will depend on the quality, diversity, coverage and quantity of the underlying data base, and it will also depend on how many markers are being utilized or compared.

For example, Family Tree DNA utilizes about 295,000 out of 710,000 autosomal SNPs tested for ethnicity prediction.  Ancestry’s V1 product utilized about 30,000, but that has increased now to about 300,000 in the 2.0 version.

When comparing your alleles to the underlying data set one by one, patterns emerge, and it’s the patterns that are important.  To begin with, T, A, C and G are not absent entirely in any population, so looking at the results, it then becomes a statistics game.  This means that, as Ancestry’s graphic, above, shows, it becomes a matter of relativity (pardon the pun), and a matter of percentages.

For example, if the A allele above is shown is high frequencies in Eastern Europe, but in lower frequencies elsewhere, that’s good data, but may not by itself be relevant.  However if an entire segment of locations, like a street of DNA addresses, are found in high percentages in Eastern Europe, then that begins to be a pattern.  If you have several streets in the city of You that are from Eastern Europe, then that suggests strongly that some of your ancestors were from that region.

To show this in more detailed format, I’m shifting to the third party tool, GedMatch and one of their admixture tools.  I utilized this when writing the series, “The Autosomal Me” and in Part 2, “The Ancestor’s Speak,” I showed this example segment of DNA.

On the graph below, which is my chromosome painting of one a small part of one of my chromosomes on the top, and my mother’s showing the exact same segment on the bottom, the various types of ethnicity are colored, or painted.

The grid shows location, or address, 120 on the chromosome and each tick mark is another number, so 121, 122, etc.   It’s numbered so we can keep track of where we are on the chromosome.

You can readily see that both of us have a primary ethnicity of North European, shown by the teal.  This means that for this entire segment, the results are that our alleles are found in the highest frequencies in that region.

Gedmatch me mom

However, notice the South Asian, East Asian, Caucus, and North Amerindian. The important part to notice here, other than I didn’t inherit much of that segment at 123-127 from her, except for a small part of East Asian, is that these minority ethnicities tend to nest together.  Of course, this makes sense if you think about it.  Native Americans would carry Asian DNA, because that is where their ancestors lived.  By the same token, so would Germans and Polish people, given the history of invasion by the Mongols. Well, now, that’s kind of a monkey-wrench isn’t it???

This illustrates why the results may sometimes be confusing as well as how difficult it is to “identify” an ethnicity.  Furthermore, small segments such as this are often “not reported” by the testing companies because they fall under the “noise” threshold of between about 5 and 7cM, depending on the company, unless there are a lot of them and together they add up to be substantial.

In Summary

In an ideal world, we would have one resource that combines all of these tools.  Of course, these companies are “for profit,” except for National Geographic, and they are not going to be sharing their resources anytime soon.

I think it’s clear that the underlying data bases need to be expanded substantially.  The reliability of utilizing contributed pedigrees as representative of a population indigenous to an area is also questionable, especially pedigrees that only reach back two generations.

All of these tools are still in their infancy.  Both Ancestry and Family Tree DNA’s ethnicity tools are labeled as Beta.  There is useful information to be gleaned, but don’t take the results too seriously.  Look at them more as establishing a pattern.  If you want to take a deeper dive by utilizing your raw data and downloading it to GedMatch, you can certainly do so. The Autosomal Me series shows you how.

Just keep in mind that with ethnicity predictions, with all of the vendors, as is particularly evident when comparing results from multiple vendors, “your mileage may vary.”  Now you know why!

Autosomal DNA, Ancient Ancestors, Ethnicity and the Dandelion

 dandelion 1

Understanding our own ancient DNA is a little different than contemporary DNA that we use for genealogy, but it’s a continuum between the two with a very long umbilical cord between them, then, and now.  And just when you think you’re about to understand autosomal DNA transmission and how it works, the subject of ancient DNA comes up.  This is particularly perplexing when all you wanted in the first place was a simple answer to the question, “who am I and who were my ancestors?”  Well, as you’re probably figured out by now, there is no simple answer.

Inheritance

In a nutshell – we know that every generation gets divided by 50% when we’re talking about autosomal DNA transmission.

So you inherit 50% of the DNA of each of your parents.  They inherited 50% of the DNA of each of their parents, so you inherit ABOUT 25% of the DNA of each of your grandparents.

Did you see that word, about?  It’s important, because while you do inherit exactly 50% of the DNA of each parent, you don’t inherit exactly 25% of the DNA of each grandparent.  You can inherit a little less or a little more from either grandparent as your parents 50% that you’re going to receive is in the mixer.

This is also true for the 12.5% of each of your great-grandparents, and the 6.25% of each of your great-great-grandparents, and so forth, on up the line.

The chart below shows the percentages that you share from each generation.

Relationship to You Approximate % Of Their DNA You Share
Parents Exactly 50%
Grandparents 25
Great-grandparents 12.5
Great-great-grandparents 6.25
Great-great-great-grandparents 3.125
Great-great-great-great-grandparents 1.5625

Ethnicity

So, here’s the question posed by people trying to understand their ethnicity.

If I have 3% Melanesian (or Middle Eastern, Indo-Tibetan or fill-in-the-blank ethnicity), doesn’t that mean that one of my great-great-great-grandparents was Melanesian?

There are really two answers to this question.  (I can hear you groaning!!!)

If the amount is 25% (for example) and not very small amounts, then the answer would be yes, that is very likely what this is telling you.  Or maybe it’s telling you that you have two different great-grandparents who have 12.5 each – but those relatives are fairly close in time due to the amount of DNA that came from that region.  See, that was easy.

However, the answer changes when we’re down in the very small percentages, below 5%, often in the 1 and 2% range.  This answer isn’t nearly as straightforward.

The Dandelion – Your Ancestor

The answer is the dandelion.

dandelion 2

The dandelion is one of your ancestors who lived in the Middle East, let’s say, 20,000 years ago, maybe 30,000 years ago.  In case you’re counting generations, that is 800 to 1200 generations ago.  The percentage of DNA you would carry from a single ancestor who lived 20,000 years ago, assuming you only descended from that ancestor 1 time, is infinitesimally small.  There are more zeroes following that decimal point than I have patience to type.  Let’s call that ancestor Xenia and let’s say she is a female.

However, you did inherit DNA from many of your ancestors who lived 20,000 years ago, thousands of them, because all of them, through their descendants, make up the DNA you carry today.  So infinitesimally small or not, you do carry some of the DNA of some of those ancestors.  It’s just broken into extremely small pieces today and their individual contributions to you may be extremely small.  You don’t carry any DNA from some of them, actually, probably most of them, due to the recombination event, dividing their DNA in half, happening 800 times, give or take.

Now, given that your ancestors’ DNA is divided in every generation by approximately half, and we know there are about 3 billion base pairs on all of your chromosomes combined, this means that by generation 32 or 33, on average, you carry 1 segment from this ancestor.  By generation 45, you carry, on average, .00017 segments of this ancestor’s DNA.  And for those math aficionados among us, this is the mathematical notation for how much of our ancestor’s DNA we carry after 800 generations: 4.4991E-232.

But, we also know that this dividing in half, on the average, doesn’t always work exactly that way in reality, because some of those ancestors from 20,000 years ago did in fact pass their DNA to you, despite the infinitesimal odds against that happening.  Some of their DNA was passed intact generation after generation, to you, and you carry it today.  The DNA contributed by any one ancestor from 800 generations ago is probably limited to one or two locations, or bases, but still, it’s there, and it’s the combined DNA of those ancient ancestors that make us who we are today.

The autosomal DNA of any specific ancestor from long ago is probably too small and fragmented to recognize as “theirs” and attribute to them.  Of course, the beauty of Y DNA and mitochondrial is that it is passed in tact for all of those generations.  But for autosomal DNA and genealogy, we need hundreds of thousands of DNA pieces in a row from a particular ancestor to be recognizable as “theirs.”  When we measure DNA for genealogy, what we are measuring is both centiMorgans, a measure of distance between chromosome positions (length) and the number of contiguous SNP (Single Nucleotide Polymorphism) base locations that match (quantity).  The values from these calculations tells us how closely we are related to people, because remember, DNA is divided in each generation so there is a mathematically predictable amount we will share with specific relatives.

Here is an example from a Family Finder comparison table showing both centiMorgans and matching SNPs with a second cousin.

family finder table

The matching threshold for genealogical significance is either 5 or 7 cM depending on which of the major companies you are using.  At Family Tree DNA, if you match above the threshold, then you can view down to 1cM, which is the case above.  Another match criteria is the number of SNPs, or locations, matching contiguously.  Anything below about 500-800 is considered to be a population match, not a genealogical match, unless you also have a significant number of genealogical matches at higher cMs and segments with this person.

OK, where is all of this going?

Dispersion

Think of your ancestor 20,000 years ago as the dandelion.  Now, blow.

dandelion 3

Xenia lived in the Middle East.  Where might her descendants land, over time, with every new generation?  In Europe?  In Asia?  In India?  In America via the Native Americans through Asia?  In North Africa?  Where?

So let’s say that groups of descendants settle across the globe.  Let’s say that her mitochondrial haplogroup is X.  Yes, haplogroup X is found both in Europe and in Asia and in the Native Americans, so this is actually a good example.  So Xenia carried mitochondrial haplogroup X and we know for sure via mitochondrial DNA testing that indeed, Xenia’s seeds were scattered to all of the winds.  The only place we haven’t found Xenia’s children is in Subsaharan Africa and the Australian archipelago, at least not yet.

Ok, so now that we know where her children and their children went, let’s go back to ancient DNA.

Predictive DNA

The way ethnicity is determined is by studying the frequency with which a specific allele or group of alleles is found in any particular population.  Two “pure” examples come to mind.

The first example is the Duffy Null allele that is only found in the Subsaharan African populations.  Currently this marker is found in about 68% of American blacks and in 88-100% of African blacks.  If you have the Duffy Null allele, you have African heritage.  Of course, you don’t know which line or which ancestor it came from, or how far back in time, but it assures you that you do in fact have African heritage.  It could have been from an ancestor long ago.  It could have been very recent.  This is one of the factors considered when determining percentage of ethnicity.

A second example is the STR marker known as D9S919 which is present in about 30% of the Native American people.  The value of 9 at this marker is not known to be present in any other ethnic group, so this mutation occurred after the Native people migrated across Beringia into the Americas, but long enough ago to be present in many descendants.  There is also no other known marker that is only found only among Native Americans, although I expect as we move into full genome sequencing we will discover more.  You can test this marker individually at Family Tree DNA, which is the only lab that offers this test.  If you have the value of 9 at this marker, it confirms Native heritage, but if you don’t carry 9, it does NOT disprove Native heritage.  After all, many Native people don’t carry it.  Again, you don’t know how long ago this marker was introduced into your ancestry.

These two examples are very unique because the markers are found only in certain groups.  Generally, with the rest of the DNA values, they are found in different amounts, or frequencies, in different parts of the world and ethnic groups.

So, if you’re trying to determine the ethnicity of an individual, you’re going to compile a huge data base of percentages of DNA values found of Ancestrally Informative Markers (AIMs) in different parts of the world.

So, you would compare the participant’s values against your data base and you will come up with those regions or ethnicities that are present most often in your comparison.  This is exactly what the products and services that provide you with your ethnicity percentages do – and how accurate the results are depend highly on the data base itself, the amount of data, and the quality of data.  Dare I mention Ancestry’s issue that they’ve had since they first began offering their autosomal product over a year ago where everyone seems to have Scandinavian ancestry?  Ancestry doesn’t share with us their sources, so as a community we have no idea how they have come up with these numbers.

You can easily compare your autosomal results in nauseating detail at both 23andMe and Family Tree DNA by testing with both companies, or by testing with either 23andMe or Ancestry and transferring your autosomal results to Family Tree DNA.  All 3 of these companies will give you a somewhat different result, but they should be in the same ballpark.  You can also then download your raw data file from any of those vendors and upload it to www.gedmatch.com where you can then do ethnicity comparisons using a variety of tools.  These tools, an example shown below, will have much more variance and detail than the vendor’s tools or results.  And because of that, they tend to be more confusing as well.

gedmatch example

Many people with small amounts of minority admixture are disappointed with the results through the vendors, especially if their Native American admixture doesn’t show.  I wrote extensively about this in my series, The Autosomal Me, so I won’t rehash it here, but using the GedMatch tools is very enlightening, as you can see above with my results.  And do I really have Indo-Tibetan and Indo-Iranian ancestors?

Where’s Xenia?

Back to Xenia and her descendants.  Let’s say that Xenia’s descendants settled in four primary locations.  One is in the Middle East – they never left home.  One is in Asia and from there, to the Americans to become the Native Americans and lastly, to Europe.  Now let’s say there is a pocket of them in the Altai region of Asia and a pocket in France.  The Altai is the ancestral home of the Native Americans and could explain the Indo-Tibet result, above.  We’ll call that Central Asia.  And France is where my Acadian ancestors were from.  Hmmm….this is getting confusing.  To make matters even more confusing, I might well descend from both groups, who originally descended from Xenia.

Let’s say that I do in fact carry small segments of Xenia’s DNA.  Now let’s say that this same DNA is found in a group of people in Central Asia, maybe in Tibet, it’s published in an obscure journal someplace, and it finds its way into a data base.  Voila – there you go – I now have a match in Central Asia in a place called Indo-Tibet.  But do I really?

Does this mean that my ancestor was from Central Asia?  Not necessarily.  And if so, maybe not recently, but the people from that location for some reason share some of the DNA that I carry.  The question of course is why, how and when?

What this really means to you is a matter of degrees.  If you have a few matches from obscure regions, along with very small percentages, it is likely a result of the dandelion’s dispersion.  If you have a lot of matches, meaning a high percentage hit rate, from a particular region, pay attention, it probably has some genealogical significance.

It’s no wonder people are confused by this!  Now, just think how many dandelions you have.  In 15 generations, you have 32,768 ancestors.  In fact, this is how we know for sure that we all descend from the same ancestor multiple times.  Our number of ancestors quickly exceeds the world population.  In 30 (25 years) generations, in about the year 1263, we reach about 1 billion ancestors.  In 1750, there were 791 million people on Earth, in 1600, 580 million, in 1500, 458 million and in 1000, 310 million.

Ancestors - Years

We know that we very likely descend several times from a much smaller group of ancestors from isolated local populations.  However, just looking at the 32,000+ ancestors in 15 generations, it’s still an entire dandelion field!!!

???????????????????????????????????????????????????????????????????????

Triangulation for Y DNA

Based on the number of questions I’m receive about triangulation, it’s time to write an article.

There are two kinds of triangulation that we use in genetic genealogy.  One type is for the Y chromosome and it’s to determine the original values of the DNA of the common ancestor.  The second type of triangulation is for autosomal DNA and it’s to determine if you share a common ancestor with someone and what the DNA of that ancestor looked like.

This article is about the first type, for Y DNA.

Why would you want to use triangulation?

Sometimes in order to know if a particular line has descended from an ancestor, you need to know what that ancestor’s Y DNA marker values were.

For example, if you have an ancestor born in the 1600s, and he had two sons whose descendants tested today, each line could have 4 mutations each, or 6, which could put the matching software over the threshold – meaning they might not be reported as matches.  We have this situation in one of the Estes lines that seems to be particularly prone to mutate.

Family Tree DNA has set up match thresholds.  For someone to be listed as your match, they need to have no more than the following total number of mutations difference from your results.

Markers in Panel Tested Maximum Number of Mutations Allowed

12

0 unless in a common project, then 1

25

2

37

4

67

7

111

10

So you can see that if you have a high number of mutations in the first panel or two, you might not show as a match.

But if you know what the original ancestors Y-line DNA looks like, then it’s easy to tell that they really are matches and that both lines have simply had several mutations.

It’s much more accurate to compare everyone to the original ancestor instead of trying to compare them to each other.

Let’s take a look at the Estes project by way of example.

Abraham Estes, the progenitor of the Southern Estes line was born in 1647 in Nonington, Kent, England.  He immigrated to Virginia in 1683 and began begetting shortly thereafter.  His wife was Barbara, and although the internet is full of family trees that say her last name is Brock, there is not one shred of evidence to support that.  In any case, Abraham and Barbara had a total of 8 sons who lived and the sons had about 42 sons, so we have a good number of Estes families throughout the US today, mostly descending from Abraham.  There is also a northern line founded by Abraham’s cousin, Richard Estes although they don’t have nearly as many descendants.

triangulation Y dna

This chart shows the results of DNA testing through 7 different Estes lines, 6 of which are Abraham’s sons and one of which is a descendant of the Northern line.

The green row at the top is Abraham’s reconstructed DNA, and now, everyone in the project gets compared to Abraham on my spreadsheet.

It’s easy to see how this is done.  For each marker, beginning with 393, we determine what the normal value is for the family.  For marker 393, all lines carry a value of 13.  One line, John through Elisha, shows a mutation to a value of 14 which would signal a line marker mutation for this particular line.  This is quite useful, because when we see someone who carries a value of 14 at this location, especially in conjunction with any other line marker mutations that might exist in that line, like a value of 11 at marker 391, we know where to look genealogically to find the tester’s place in the family.  Line marker mutations are great guideposts.

So, marker by marker, I’ve reconstructed Abraham, shown at the top in green.

Marker Frequency

You might wonder why the value of 25 at 390 is red and underscored and 12 at 391 is bolded, red and underscored.

One of the things I do for each of my family lines, and for clients who order Personalized DNA Reports, is to determine which of their markers carry rare values.  In this case, the value of 25 at 390 is found in only 16% of haplogroup R1b1a2.  The value of 12 at 391 is found in only 4% of the haplogroup R1b1a2 population.  My threshold for rare markers is less than 25% and for very rare, 6% or less.  Bold red indicates very rare, red indicates rare and the underscore is present so that people printing in black and white can see the difference

Why and how does this make a difference?  In a situation where you’re trying to decide if someone really does match the Estes line, this information can be a big help.

The last kit on the chart does carry the Estes surname, but does not match the Estes line genetically.  This is obvious by looking at all the yellow squares, which are mismatches to Abraham, but let’s say that this person tested at 12 markers and he matched the Estes DNA on all of our rare markers, but mismatches a couple on the more common markers.  This is more likely a true Estes match than if they mismatch us on all of our rare markers.  The Estes rare markers combined create a type of family genetic fingerprint.  This is particularly important for adoptees.

And yes, to answer the next question, a Marker Frequency Table can be purchased separately for those who want their marker frequencies through 111 markers, but don’t want a Personalized DNA Report, by purchasing a Quick Consult.  A marker frequency table looks like this but extended, of course, through all of your markers:

Frequency table

Now, we know what the original Abraham Estes’s DNA looked like.  We also know which of our markers are unique.  This can also help us when comparing to other surnames we may be related to before the advent of surnames.  There is family history to be gleaned from those matches as well.

And lastly, because we also have cousin Richard’s DNA signature, we can use that information to reconstruct the common ancestor of Abraham Estes and Richard Estes, which is the grandfather of both men, Robert Estes, born 1555 in Ringwould, Kent, England.  Not bad for genetic technology, reaching back more than 450 years in time and telling us what our ancestor’s DNA looked like, and all without even reaching for a shovel.