Native American Maternal Haplogroup A2a and B2a Dispersion

Recently, in, they published a good overview of a couple of recently written genetic papers dealing with Native American ancestry.  I particularly like this overview, because it’s written in plain English for the non-scientific reader.

In a nutshell, there has been ongoing debate that has been unresolved surrounding whether or not there was one or more migrations into the Americas.  These papers use these terms a little differently.  They not only talk about entry into the Americas but also dispersion within the Americans, which really is a secondary topic and happened, obviously, after the initial entry event(s).

The primary graphic in this article, show below, from the PNAS article, shows the distribution within the Americas of Native American haplogroups A2a and B2a.

a2a, b2a

Schematic phylogeny of complete mtDNA sequences belonging to haplogroups A2a and B2a. A maximum-likelihood (ML) time scale is shown. (Inset) A list of exact age values for each clade. Credit: Copyright © PNAS, doi:10.1073/pnas.0905753107

As you can see, the locations of these haplogroups are quite different and the various distribution models set forth in the papers account for this difference in geography.

One of the aspects of this paper, and the two academic papers on which it is based, that I find particularly encouraging is that the researchers are utilizing full sequence mitochondrial DNA, not just the HVR1 or HVR1+HVR2 regions which has all too often been done in the past.  In all fairness, until rather recently, the expense of running the full sequence was quite high and there were few (if any) other results in the academic data bases to compare the results with.  Now, the cost is quite reasonable, thanks in part to genetic genealogy and new technologies, and so the academic testing standards are changing.  If you’ll note, Alessandro Achilli, one of the authors of these papers and others about Native Americans as well, also comments towards the end that full genome testing will be being utilized soon.  I look forward to this new era of research, not only for Native Americans but for all of us searching for our roots.

Read the paper at:

The original academic papers are found here and here.  I encourage anyone with a serious interest in this topic to read these as well.

Ancestor of Native Americans in Asia was 30% “Western Eurasian”

The complete genome has recently been sequenced from 4 year old Russian boy who died 24,000 years ago near Lake Baikal in a location called Mal’ta, the area in Asia believed to be the origin of the Native Americans based on Y DNA and mitochondrial chromosome similarities.  The map below, from Science News, shows the location.

malta boy map

This represents the oldest complete genome ever sequenced, except for the Neanderthal (38,000 years old) and Denisovan (41,000 years old).

This child’s genome shows that he is related closely to Native Americans, and, surprisingly, to western Asians/eastern Europeans, but not to eastern Asians, to whom Native Americans are closely related.  This implies that this child was a member of part of a “tribe” that had not yet merged or intermarried with the Eastern Asians (Japan, China, etc.) that then became the original Native Americans who migrated across the Beringian land bridge between about 15,000 and 20,000 years ago.

One of the most surprising results is that about 30% of this child’s genome is Eurasian, meaning from Europe and western Asia, including his Y haplogroup which was R and his mitochondrial haplogroup which was U, both today considered European.

This does not imply that R and U are Native American haplogroups or that they are found among Native American tribes before European admixture in the past several hundred years.  There is still absolutely no evidence in the Americas, in burials, for any haplogroups other than subgroups of Q and C for males and A, B, C, D, X and M (1 instance) for females.  However, that doesn’t mean that additional evidence won’t be found in the future.

While this is certainly new information, it’s not unprecedented.  Last year, in the journal Genetics, an article titled “Ancient Admixture in Human History” reported something similar, albeit gene flow in a different direction.  This paper indicated gene flow from the Lake Baikal area to Europe.  It certainly could have been bidirectional, and this new paper certainly suggests that it was.

So in essence, maybe there is a little bit of Native American in Europeans and a little bit of European in Native Americans that occurred in their deep ancestry, not in the past 500-1000 years.

What’s next?  Work continues.  The team is now attempting to sequence genomes from other skeletons from west of Mal’ta, East Asia and from the Americas as well.

You can read the article in Science Magazine.  An academic article presenting their findings in detail will be published shortly in Nature.

A Podcast with Michael Balter can be heard here discussing the recent discovery.

Human Genetics Revolution Tells Us That Men and Women Are Not the Same

Stop laughing.  I know, my initial reaction too was, “really – it took genetics to tell us that?”  But this is serious….really.

Males are 99.9% the same when compared to other males, and females are as well when compared to other females, but males and females are only 98.5% equal to each other – outside of the X and Y chromosomes.  The genetic difference between men and women is 15 times greater than between two men or two women.  In fact, it’s equal to that of men and male chimpanzees.  So men really are from….never mind.  It’s OK to laugh now…

men-women 1

We’ve been taught that other than X and Y, males and females are genetically exactly the same.  They aren’t.

men-women 2

Does this matter?  Dr. David Page, Director of the Whitehead Institute and MacArthur Genius Grant winner, says it absolutely does.  He has discovered that both the X and Y chromosomes function throughout the entire body, not just within the reproductive tract.

In his words, “Humane Genome, we have a problem.”  Medicine and research fails to take into account this most fundamental difference.  We aren’t unisex, and our bodies know this – every cell knows it at the molecular level, according to Dr. Page.

For example, some non-reproductive tract diseases appear in vastly different percentages in men and women.  Autism is found in 5 times as many males as females, Lupus in 6 times as many women as men and Rheumatoid Arthritis in 5 times as many women as men.  In other diseases, men and women either react differently to disease treatment, react differently to the disease itself, or both.  Dr. Page explains more and suggests a way forward in this short but very informative video.

About Dr. David Page:

David Page, Director of the Whitehead Institute and professor of biology at MIT, has shaped modern genomics and mapped the Y chromosome.  His renowned studies of the sex chromosomes have shaped modern understandings of reproductive health, fertility and sex disorders.

Why Are My Predicted Cousin Relationships Wrong?

The answer is, because inherited DNA segments do not always follow the 50% rule.  I guess maybe no one told them???

Many times, when we receive our autosomal DNA results, we wonder why predicted relationships, particularly distant ones, aren’t accurate.  Sometimes people estimated to be 3rd cousins, or maybe 2nd to 4th cousins, turn out to be 6th cousins, for example.  This happens because genetic predictions must use math models and averages, but our actual DNA doesn’t follow those rules.

Dr. Steve Mount is an Associate Professor of Cell Biology and Molecular Genetics at the University of Maryland.  In February 2011, he wrote an article about his experience submitting his DNA to 23andMe and his experiences matching his cousins.  More specifically, he became interested in one particular segment of DNA trackable to a specific ancestor.

He shares these insights.

  • Distant relatives (4th cousins and beyond) often share no genetic material at all.
  • It is possible to share a segment with very distant relatives.
  • Sometimes, more distant relationships are more likely.
  • Most of your relatives may be descended from a small fraction of your ancestors.

In genetic genealogy, people who deal with autosomal DNA spend a lot of time trying to figure out which segments are IBD vs IBS – Identical by Descent versus Identical by State.  In laymen’s terms, identical by descent means that you do in fact share a common ancestor in a timeframe in which you might be able to identify them.  Identical by state really implies, technically, that you just happen to have the same DNA due to spontaneous mutations, not because you share a common ancestor.  In reality, it’s taken to mean that you descend from a common population –  in other words, you do share a common ancestor but the segment is so small that it implies that the ancestor is so far back in time that you can’t possibly identify them.  Some people call these matches “false positives” which really isn’t accurate.

Far from being useless, these small segments are very useful in identifying different ethnic populations found in your ancestral tree and can, often in conjunction with larger segments also be useful in identifying ancestral lines.  Discounting small segments, especially if you share a common ancestor, is akin to throwing away pennies because they aren’t as useful and are more difficult to manage than quarters or dollars.  Furthermore, small segments may be our only way of identifying ancestors that are many generations back in our tree.  After all, we inherited all of our DNA from some ancestor, no matter how small the segments are today.

Because we have no better rule of thumb (or statistical model), we utilize the theory that one inherits about 50% of the DNA of each ancestor in each generation.  We know this is absolutely true between Mom and Dad, but you don’t receive exactly 25% of each of your grandparents’ DNA.  However, the mixture of what and how much of your grandparents’ DNA you do inherit is approximately 25% and appears to be random, like a card shuffle.  If it’s not random, we don’t know what the rules of inheritance are.

In the past few years, as we’ve come to work more closely with autosomal results, we have learned that while the rules of thumb about how much DNA you inherit from specific ancestors are useful, they are not absolute.  In other words, it’s certainly possible to inherit a very large chunk of DNA from a very specific distant ancestor when the rules of probability and the rule of thumb of 50% would indicate that you should not.

This is shown clearly in the Vannoy project where 5 cousins who descend from Elijah Vannoy born in 1786 (5 generations removed) share a very significant portion of chromosome 15.  These people are all 5 generations or more distantly related from the common ancestor, (approximate 4th cousins) and should share less than 1% of their DNA in total, and certainly no large, unbroken segments.   As you can see, below, that’s not the case.  We don’t know why or how some DNA clumps together like this and is transmitted in complete (or nearly complete) segments, but they obviously are.  We often call these “sticky segments” for lack of a better term.

cousin 1

I downloaded this information into a spreadsheet where I can sort it by chromosome.  Below you can see the segments on chromosome 15 where these cousins match me.  Note that Buster is also a cousin from a second ancestor.

cousin 2

Given these incidental discoveries and the very large amount of DNA I share with these cousins on chromosome 15, I was quite interested in Dr. Mount’s following commentary:

“The probability that fourth cousins share at least one IBD [identical by descent] segment is 77%, and the expected length of this segment is 10 cM.” Now consider the next step. There is a 50% chance that that one shared segment will not be transmitted at all, but a 90% chance that if it is transmitted it will be just as big as it was (the same 10 cM.). What this means for genealogy on 23andMe is that for two people sharing one segment identical by descent there is no way to reliably estimate how far back the common ancestor was. Furthermore, no improvement in software can possibly change that, because the limitation is imposed by the genetics itself.”

Well, there goes the 50% rule – flying right out the window.  The 50% rule of thumb says that in any given transmission, there is a 50% chance that it will be transmitted (so good so far) and that if it is transmitted, roughly half of it would be transmitted, or approximately 5 cM..  That’s obviously not what is happening.

Dr. Mount goes on to say that, “No matter how far back you go, every nucleotide of one’s genome is derived from some ancestor, and even going back 20 generations, the chance that the bit which has been inherited is part of a block 5 cM. or greater is still appreciable. In fact, even for 19th cousins, there is a real chance (13%) that any segment of DNA they have inherited in common will be 5 cM. or greater. Of course, as mentioned above, there is very little chance that two 19th cousins will share any IBD segments at all, but this is offset if one has many 19th cousins, which is often the case.”

5cM is the line-in-the-sand cutoff number many genetic genealogists use to determine whether DNA segments are IBD or IBS.

What this really means is that the more distant, or 19th, cousins that you have, the greater the chance that one or more of them will test and will indeed share a piece of DNA large enough to be identified by the testing companies as relevant.  The software companies will then apply their relationship estimating software to the size of the match and number of SNPs.  The results are often inaccurate, as Dr. Mount says.  Not inaccurate in that the match is incorrect, but the estimated relationship is incorrect because the DNA did not divide in half as the mathematical model says it should.  The “problem” is not in the software, but in the DNA itself.

“23andMe reports a “predicted relationship” (e.g. “4th cousin”) and a “relationship range” (e.g. “3rd to 7th cousin”). However, these ranges are likely to be wildly inaccurate, because the likely distance to a common ancestor, given only the information that two people share a single IBD segment, can vary enormously, based largely on how many relatives one has.”

And I will add, it will also vary by how and how much the DNA has or has not divided in every generation.

Dr. Mount goes on to provide the math and probability formulas for these various calculations, and explains what they mean, in English, then he summarizes by saying, “

“Thus, if you have many more distant cousins, as would be expected if your ancestors had large families, then someone who shares a single IBD segment is more likely to be a distant cousin, because you have so many more distant cousins. The point where the increase in the number of cousins outweighs the loss of shared segments is five children per family. This is not extremely uncommon.”

This actually makes a lot of sense when I look at my results.  One of my ancestors, Abraham Estes (1647-1720) had at least 12 children of which 11 reproduced and had very large families.  This line was extremely prolific.  Many of my autosomal matches include Estes descendants.  Some of my other lines where my ancestor was one of just a few children have far fewer matches, likely because there are far fewer people out there descended from them.

Dr. Mount confirms this by saying that, “If one family among [your] 32 [great-great-great-grandparents] had five children and their descendants did as well, while others in the family reproduced at replacement rates (two children per family), then your more prolific ancestors (the parents of just one of your 31 great-great-grandparents) would account for over 3/4 of your fourth cousins.”

So what is the take away message to us from all of this?

  • The autosomal testing companies are doing the best they can predicting your cousin-level relationships with what they have to work with.
  • Real life genetic transmission does not follow the 50% rule of thumb beyond the first generation (parent-child).
  • The predictions get more uncertain and therefore unreliable the more distant they are.
  • Based on the unmeasureable randomness of the genetic transmission involved, there is no way for the testing companies to improve their predictions.
  • Expect more matches to your more prolific lines, and less to lines who had fewer children.
  • Beyond about the first or second cousin level, understand that predictions are only suggestions based on math.  Given that you understand why and how reality can vary, you can then utilize this information when analyzing your matches.
  • Drawing an arbitrary cM line for IBS vs IBD and utilizing only the segments above that threshold may eliminate the small segments you need to identify ancestors many generations removed.
  • Endogamous populations throw a monkey wrench into estimates and calculations, because population members are likely related many times over in unknown ways.  This makes the estimate of relatedness of two people appear closer than it is genealogically.  At least one of the testing companies, Family Tree DNA, attempts to correct for this mathematically when they are aware of the situation, such as in Jewish families.

You can read Dr. Mount’s article including his mathematical proofs, here.

Determining Ethnicity Percentages

Recently, as a comment to one of my blog postings, someone asked how the testing companies can reach so far back in time and tell you about your ancestors.  Great question.

The tests that reliably reach the furthest back, of course, are the direct line Y-Line and mitochondrial DNA tests, but the commenter was really asking about the ethnicity predictions.  Those tests are known as BGA, or biogeographical ancestry tests, but most people just think of them or refer to them as the ethnicity tests.

Currently, Family Tree DNA, 23andMe and all provide this function as a part of their autosomal product along with the Genographic 2.0 test.  In addition, third party tools available at don’t provide testing, but allow you to expand what you can learn with their admixture tools if you upload your raw data files to their site.  I wrote about how to use these ethnicity tools in “The Autosomal Me” series.  I’ve also written about how accurate ethnicity predictions from testing companies are, or aren’t, here, here and here.

But today, I’d like to just briefly review the 3 steps in ethnicity prediction, and how those steps are accomplished.  It’s simple, really, in concept, but like everything else, the devil is in the details.devil

There are three fundamental steps.

  • Creation of the underlying population data base.
  • Individual DNA extraction.
  • Comparison to the underlying population data base.

Step 1:  Creation of the underlying population data base.

Don’t we wish this was as simple as it sounds.  It isn’t.  In fact, this step is the underpinnings of the accuracy of the ethnicity predictions.  The old GIGO (garbage in, garbage out) concept applies here.

How do researchers today obtain samples of what ancestral populations looked like, genetically?  Of course, the evident answer is through burials, but burials are not only few and far between, the DNA often does not amplify, or isn’t obtainable at all, and when it is, we really don’t have any way to know if we have a representative sample of the indigenous population (at that point in time) or a group of travelers passing through.  So, by and large, with few exceptions, ancient DNA isn’t a readily available option.

The second way to obtain this type of information is to sample current populations, preferably ones in isolated regions, not prone to in-movement, like small villages in mountain valleys, for example, that have been stable “forever.”  This is the approach the National Geographic Society takes and a good part of what the Genograpic Geno 2.0 project funding does.  Indigenous populations are in most cases our most reliable link to the past.  These resources, combined with what we know about population movement and history are very telling.  In fact, National Geographic included over 75,000 AIMs (Ancestrally Informative Markers) on the Geno 2.0 chip when it was released.

The third way to obtain this type of information is by inference.  Both and 23andMe do some of this.  Ancestry released its V2 ethnicity updates this week, and as a part of that update, they included a white paper available to DNA participants.  In that paper, Ancestry discusses their process for utilizing contributed pedigree charts and states that, aside from immigrant locations, such as the United States and Canada, a common location for 4 grandparents is sufficient information to include that individuals DNA as “native” to that location.  Ancestry used 3000 samples in their new ethnicity predictions to cover 26 geographic locations.  That’s only 115 samples, on average, per location to represent all of that population.  That’s pretty slim pickins.  Their most highly represented area is Eastern Europe with 432 samples and the least represented is Mali with 16.  The regions they cover are shown below.

ancestry v2 8

Survey Monkey, a widely utilized web survey company, in their FAQ about Survey Size For Accuracy provides guidelines for obtaining a representative sample.  Take a look.  No matter which calculations you use relative to acceptable Margin of Error and Confidence Level, Ancestry’s sample size is extremely light.

23andMe states in their FAQ that their ethnicity prediction, called Ancestry Composition covers 22 reference populations and that they utilize public reference datasets in addition to their clients’ with known ancestry.

23andMe asks geographic ancestry questions of their customers in the “where are you from” survey, then incorporates the results of individuals with all 4 grandparents from a particular country.  One of the ways they utilize this data is to show you where on your chromosomes you match people whose 4 grandparents are from the same country.  In their tutorial, they do caution that just because a grandparent was born in a particular location doesn’t necessarily mean that they were originally from that location.  This is particularly true in the past few generations, since the industrial revolution.  However, it may still be a useful tool, when taken with the requisite grain of salt.

23andme 4 grandparents

The third way of creating the underlying population data base is to utilize academically published information or information otherwise available.  For example, the Human Genome Diversity Project (HGDP) information which represents 1050 individuals from 52 world populations is available for scrutiny.  Ancestry, in their paper, states that they utilized the HGDP data in addition to their own customer database as well as the Sorenson data, which they recently purchased.

Academically published articles are available as well.  Family Tree DNA utilizes 52 different populations in their reference data base.  They utilize published academic papers and the specific list is provided in their FAQ.

As you can see, there are different approaches and tools.  Depending on which of these tools are utilized, the underlying data base may look dramatically different, and the information held in the underlying data base will assuredly affect the results.

Step 2:  Your Individual DNA Extraction

This is actually the easy part – where you send your swab or spit off to the lab and have it processed.  All three of the main players utilize chip technology today.  For example, 23andMe focuses on and therefore utilizes medical SNPs, where Family Tree DNA actively avoids anything that reports medical information, and does not utilize those SNPs.

In Ancestry’s white paper, they provide an excellent graphic of how, at the molecular level, your DNA begins to provide information about the geographic location of your ancestors.  At each DNA location, or address, you have two alleles, one from each parent.  These alleles can have one of 4 values, or nucleotides, at each location, represented by the abbreviations T, A, C and G, short for Thymine, Adenine, Cytosine and Guanine.  Based on their values, and how frequently those values are found in comparison populations, we begin to fine correlations in geography, which takes us to the next step.

ancestry allele snps

Step 3:  Comparison to Underlying Population Data Base

Now that we have the two individual components in our recipe for ethnicity, a population reference set and your DNA results, we need to combine them.

After DNA extraction, your individual results are compared to the underlying data base.  Of course, the accuracy will depend on the quality, diversity, coverage and quantity of the underlying data base, and it will also depend on how many markers are being utilized or compared.

For example, Family Tree DNA utilizes about 295,000 out of 710,000 autosomal SNPs tested for ethnicity prediction.  Ancestry’s V1 product utilized about 30,000, but that has increased now to about 300,000 in the 2.0 version.

When comparing your alleles to the underlying data set one by one, patterns emerge, and it’s the patterns that are important.  To begin with, T, A, C and G are not absent entirely in any population, so looking at the results, it then becomes a statistics game.  This means that, as Ancestry’s graphic, above, shows, it becomes a matter of relativity (pardon the pun), and a matter of percentages.

For example, if the A allele above is shown is high frequencies in Eastern Europe, but in lower frequencies elsewhere, that’s good data, but may not by itself be relevant.  However if an entire segment of locations, like a street of DNA addresses, are found in high percentages in Eastern Europe, then that begins to be a pattern.  If you have several streets in the city of You that are from Eastern Europe, then that suggests strongly that some of your ancestors were from that region.

To show this in more detailed format, I’m shifting to the third party tool, GedMatch and one of their admixture tools.  I utilized this when writing the series, “The Autosomal Me” and in Part 2, “The Ancestor’s Speak,” I showed this example segment of DNA.

On the graph below, which is my chromosome painting of one a small part of one of my chromosomes on the top, and my mother’s showing the exact same segment on the bottom, the various types of ethnicity are colored, or painted.

The grid shows location, or address, 120 on the chromosome and each tick mark is another number, so 121, 122, etc.   It’s numbered so we can keep track of where we are on the chromosome.

You can readily see that both of us have a primary ethnicity of North European, shown by the teal.  This means that for this entire segment, the results are that our alleles are found in the highest frequencies in that region.

Gedmatch me mom

However, notice the South Asian, East Asian, Caucus, and North Amerindian. The important part to notice here, other than I didn’t inherit much of that segment at 123-127 from her, except for a small part of East Asian, is that these minority ethnicities tend to nest together.  Of course, this makes sense if you think about it.  Native Americans would carry Asian DNA, because that is where their ancestors lived.  By the same token, so would Germans and Polish people, given the history of invasion by the Mongols. Well, now, that’s kind of a monkey-wrench isn’t it???

This illustrates why the results may sometimes be confusing as well as how difficult it is to “identify” an ethnicity.  Furthermore, small segments such as this are often “not reported” by the testing companies because they fall under the “noise” threshold of between about 5 and 7cM, depending on the company, unless there are a lot of them and together they add up to be substantial.

In Summary

In an ideal world, we would have one resource that combines all of these tools.  Of course, these companies are “for profit,” except for National Geographic, and they are not going to be sharing their resources anytime soon.

I think it’s clear that the underlying data bases need to be expanded substantially.  The reliability of utilizing contributed pedigrees as representative of a population indigenous to an area is also questionable, especially pedigrees that only reach back two generations.

All of these tools are still in their infancy.  Both Ancestry and Family Tree DNA’s ethnicity tools are labeled as Beta.  There is useful information to be gleaned, but don’t take the results too seriously.  Look at them more as establishing a pattern.  If you want to take a deeper dive by utilizing your raw data and downloading it to GedMatch, you can certainly do so. The Autosomal Me series shows you how.

Just keep in mind that with ethnicity predictions, with all of the vendors, as is particularly evident when comparing results from multiple vendors, “your mileage may vary.”  Now you know why!

Ancestry’s Updated V2 Ethnicity Summary

Today when I signed onto, I was greated with a message that my new Ethnicity Estimate Preview was ready for viewing.  Yippee!

Ancestry v2 1

Ancestry announced some time back that they were updating this function.  Release 1 was so poor that it should never have been released.  However, V2 is somewhat improved.  In any case, it’s different. Let’s take a look.

The graphic below shows my initial, V1 results, which bore very little resemblance to my ancestry.  My V1 results are shown below, and they are still shown on my page at Ancestry.  I was pleased so see that so I have a reference for comparison.

ancestry v2 2

Some years back, I did a pedigree analysis of my genealogy in an attempt to make sense of autosomal results from other companies.

The paper, “Revealing American Indian and Minority Heritage using Y-line, Mitochondrial, Autosomal and X Chromosomal Testing Data Combined with Pedigree Analysis” was published in the Fall 2010 issue of JoGG, Vol. 6 issue 1.

The pedigree analysis portion of this document begins about page 8.  My ancestral breakdown is as follows:

Geography Percent
Germany 23.8041
British Isles 22.6104
Holland 14.5511
European by DNA 6.8362
France 6.6113
Switzerland .7813
Native American .2933
Turkish .0031

This leaves about 25% unknown.  However, this looks nothing like the 80% British Isles and the 12% Scandinavian in Ancestry’s V1 product.

In an article titled, “Ethnicity Results, True or Not” I compared my pedigree information with the results from all the testing vendors, including Ancestry’s V1 information.  Needless to say, they didn’t fare very well.

The next screen you see talks about what’s new, but being very anxious to see the results, I bypassed that for the moment to see my new results shown below.

ancestry v2 3

My initial reaction was that I was very excited to see both my Native and African admixture shown.  I thought maybe Ancestry had actually hit a home run.  Then I looked down and saw the rest.  Uh, no home run I’m afraid.  Shucks.  Clicking on the little plus signs provide this view.

ancestry v2 4

I noticed the little box at the bottom that says “show all regions,” so I clicked there.  The only difference between that display and the one above is that the regions with zero displayed as well.

My updated V2 results show primarily Western European and Scandinavian.  I certainly won’t argue with the western European, although the percentage seems quite high, but there is absolutely NO indication that I have any Scandinavian heritage, let alone 10%, and my British Isles is dramatically reduced.

Here are the two results side by side, in percentages, with my commentary.

Location Ancestry V1 Ancestry V2 My Pedigree Comments
British Isles 80 Great Britain 4, Ireland 2 22 Great Britain includes Scotland
Scandinavia 12 10 0
Italy/Greece 0 2 Turkish <1
North Africa 0 <1 0
Native American 0 <1 <1
East Asian 0 <1 0 Probably Native American
Western Europe 0 79 51
Uncertain 8 0 25

I am not going to take issue with any of the small percentages.  I fully understand how difficult trace ethnicity is to decipher.  My concern here is with the “big chunks,” because if the big chunks aren’t correct, there is also no confidence in the small ones.

I’m left wondering about the following:

  • I went from 80% British Isles in V1, which we knew was incorrect, to 6% in V2, which is also incorrect.  I have at least 22% British Isles.
  • I went from being 0% Western European in V1 to 79% in V2, which is also incorrect.  Now granted, I do have 25% uncertain in my own pedigree, and given that I’m a cultural mixture, some of that certainly could be western European.  But all of it?  Given where my ancestor were found in colonial America, and when, it’s much more likely that the majority of the 25% that is uncertain in my pedigree chart would be British Isles.
  • Would you look at the V1 results and the V2 results, side by side, and believe for one minute they were describing the same person?  This is not a minor revision and there is very little consistency between the two – only 16%.  That means that 84% changed between the two versions.  And in that 16% is that pesky, unexplained Scandinavian, not found, by the way, by any other testing company.  Yes, I know about the Vikings, but still, 10 or 12%?  That’s equivalent to a great-grandparent, not trace amounts from centuries ago.

So V2 seems to be somewhat better, I think, but still no place close to what is known to be correct.  Based on the V2 results, which seem to have very little resemblance to the V1 results, I can’t help but wonder why Ancestry would have published such highly incorrect results for V1, and then adamantly defended those results, publishing videos, etc.  Doesn’t a corporation have some responsibility to their customers to provide correct information, and if they can’t, to be smart enough to know that and to not publish anything?  And if it’s the same technical team behind the scenes, how do we know that V2 isn’t equally as flawed, given that the results still don’t seem to jive with my known (and for the most part, DNA proven) pedigree chart?

One thing Ancestry has done that is an improvement is to provide additional information about their process for determining admixture and what has changed in the V2 version.  I went back and looked at the “What’s New” information that I skipped in my excitement to see my new results.  In that information, they provide the following bullets:

  • They increased the number of markers used for comparison from 30,000 to 300,000.
  • They increased the analysis passes from 1 to 40.  This is further explained in their white paper.
  • They broke Europe into 4 regions.
  • They broke West Africa into 6 regions.

ancestry v2 6

  • They updated the regions covered.  The V2 reference panel contains 3,000 samples that represent 26 distinct overlapping global regions (Table 3.1, below, from their white paper).  V1 covered 22 regions.


# Samples

Great Britain 111
Ireland 138
Europe East 432
Iberian Peninsula 81
European Jewish 189
Europe North 232
Europe South 171
Europe West 166
Finnish/Northern Russian 59
Africa Southeastern Bantu 18
Africa North 26
Africa Southcentral Hunter Gatherers 35
Benin/Togo 60
Cameroon/Congo 115
Ivory/Ghana 99
Mali 16
Nigeria 67
Senegal 28
Native American 131
Asia Central 26
Asia East 394
Asia South 161
Melanesia 28
Polynesia 18
Caucasus 58
Near East 141
  • Ancestry provided a white paper on their methods which explains how these ethnicity estimates are created.  This is very important and I applaud them for their transparency.  Unfortunately, you can’t see the white paper unless you are a subscriber and have taken their autosomal DNA test.  If you have, to see the white paper, click on the little question mark in the upper right hand corner of the ethnicity results page, then on the “whitepaper” icon.

ancestry v2 7

How Are Ethnicity Percentages Created?

Wanting to understand the process they are using, I moved to their educational maternal and Ethnicity Estimate white paper, which, unfortunately I can’t link to.  You must be a subscriber to see this document.

The first thing I discovered is that they utilized 3000 DNA samples as a reference data base, including the Humane Genome Diversity Project data utilized by all researchers in this field.

ancestry v2 8

From their white paper:

“In developing the AncestryDNA ethnicity estimation V2 reference panel, we begin with a candidate set of 4,245 individuals. First, we examine over 800 samples from 52 worldwide populations from a public project called the Human Genome Diversity Project (HGDP) (Cann et al. 2002; Cavalli-Sforza 2005). Second, we examine samples from a proprietary AncestryDNA reference collection as well as AncestryDNA samples from customers consenting to participate in research. To obtain candidate reference panel candidates from these two sets, family trees are first consulted, and a sample is included in the candidate set if all lineages trace back to the same geographic region. Although this was not possible for HGDP samples, this dataset was explicitly designed to sample a large set of populations representing a global picture of human genetic variation.

In total, our reference panel candidates include over 800 HGDP samples, over 1,500 samples from the proprietary AncestryDNA reference collection, and over 1,800 AncestryDNA customers who have explicitly consented to be included in the reference panel.”

I’m assuming that the proprietary reference collection they mention is the Sorenson data they purchased in July 2012.  The Sorenson data base was compiled from individual donors who contributed the DNA samples and pedigree charts but without any supporting documentation.

So in addition to the publicly available data, Ancestry has utilized both the Sorenson and their own data bases.  That makes sense.  It may also be the root of the problem.

There’s another quote from their paper:

“Fortunately, knowing where your grandparents are born is often a sufficient proxy for much deeper ancestry. In the recent past, it was much more difficult and thus less common for people to migrate large distances. Because of this, it is frequently the case that the birthplace of your grandparents represents a much more ancient ancestral origin for your DNA.”

They do say that this does not apply to people in America, for example.

However, how many of you have confidence in the Ancestry trees, or any trees submitted, for that matter, in public data bases.  Ancestry only allows you to attach “facts” found in their data base.  This means, for example, if you want to upload your Gedcom file that has pages and pages of documentation including wills, tax lists, and other primary sorts of documentation, you can’t.  Well, you can, but only if you copy it off into a word document and attach it separately to that person one page at a time.  In other words, Ancestry isn’t interested in any documentation or research that you’ve done elsewhere.  This also means that they have few tools themselves to determine whether your tree is accurate, especially once you get beyond the census years with family enumeration – meaning 1850 in the US.  What this means is that the only reliable references they have are their own data bases, excluding Rootsweb trees.  Ancestry owns Rootsweb too and Rootsweb has always allowed uploads of limited notes attached to people.  Some are exceedingly useful.

If Ancestry is utilizing large numbers of user submitted pedigree charts by which to calibrate or measure ethnicity, that could be a problem.

Let’s run a little experiment.  I am very familiar with the original records pertaining to Abraham Estes, born in 1647 in Nonington, Kent, England and who died in 1720 in King and Queen County, Virginia.  I have been a primary records researcher on this man for 25 years.  Not only are his records documented, but so are those of several preceding generations through church records in England.  In other words, we know what we know and what we don’t know.  We do NOT know his second wife’s surname, although there is a pervasive myth as to what it was, which is entirely unsubstantiated.

I entered his name/birth year into Ancestry’s search tool and I looked at the first 20 records show in their “Family Trees.”  I wanted to see how many displayed correct or incorrect information.  Ancestry displays these trees in order, based, apparently, on the number of source or attached records, implying records with more sources would be better to utilize.  That would generally be quite true.  Unfortunately, sources are often the IGI or Family Data Collection, which are also “unsourced,” creating a vicious cycle of undocumented rumors cited as sources.  Let’s take a look at what we have.

Record # Incorrect Info Listed Correct Info Listed Grandparents Info Present/Correct
1 First wife’s name entirely incorrect, but linked to correct original record.  Second wife’s surname entirely undocumented.  Multiple family crests listed but family was not armorial.  Children listed multiple times.  Son, Abraham’s records attached to father. Birth year and location. Death date and location. No
2 First wife entirely missing. Second wife’s surname entirely undocumented. Marriage date entirely undocumented.  Third, unknown spouse listed with the same children given to spouse 2 and 3. Birth year and location.    Death date and location. No
3 Abraham was given fictitious middle name.  Second wife’s surname entirely undocumented.  Most children missing and the two that are on the list are given fictitious middle names.  Marriage date for second wife is entirely undocumented. Birth year and location, first marriage, death date and   location. No
4 First wife’s surname missing.  Second wife’s surname entirely   undocumented.  Have land transaction attached to him 13 years after he died.  Incorrect childen. Birth date and location, first wife’s first name and date   of marriage, death date and location. No
5 Shows marriage for first and second wife on same   day/place.  First wife’s name entirely wrong.  Shows a second marriage date to second wife.  Second wife’s surname  entirely undocumented.  No burial   location known, but burial location given.  Incorrect children. Birth year and location. No

After these first 5 records, I became discouraged and did not type the balance of the 15 records.  Not one displayed only correct information, nor did any have the man’s parents and grandparents names and birth locations documented correctly.  So much for using family trees as sources.

If Ancestry is assuming that where your grandfather was born is representative of where your family was originally from, if you are from a non-immigrant location (i.e. not the US, not Canada, not Australia, etc.), that too might be a problem.  There has been a lot of movement in the British Isles, for example, since the industrial revolution, particularly in the 1800s.  Where Abraham’s grandfather was born in 1555 is probably relevant, but the grandfather of someone living today is much less predictive.

So, where does this leave us? 

Apparently Ancestry’s V1 was worse than we thought, given that my 80% majority ancestry turned into 6 and my 0% western Europe turned into 79%.  Neither of these are correct.

Ancestry’s V2 seems to be somewhat better, but raises the same types of questions about the results.

Ancestry’s white paper may indeed answer some of those questions, based on their use of contributed pedigree charts.  However, having said that, you would think that they could utilize families with a deep history of ancestry in a specific area, proven by various non-contributed (such as parish or will) records, in a non-urban environment.

Ironically, Ancestry did pick up on both my Native and African minority admixture, but they are still missing the boat on the majority factors, which calls the entire concoction into question.

So the net-net of all of this….it’s still not soup yet.  I’m disappointed and beginning to wonder if it ever will be.

Correlating Historical Facts to DNA Test Results

Sometimes DNA tests hold surprising results, results that the individual didn’t expect.  That’s what happened to Jack Goins, Hawkins County, Tn. Archivist and founder of the Melungeon Core DNA project.  Jack, a Melungeon descendant through several ancestors, expected that his Y paternal haplogroup would be either European or Native American, based on oral family history, but it wasn’t, it was E1b1a, African.

Jack’s family and ancestors were key members of the Melungeon families found in Hawkins and Hancock Counties in Tennessee beginning in the early 1800s.  In order to discover more about this group of people, which included but was not limited to his own ancestors, Jack founded the Melungeon DNA projects.

Over time, descendants of most of the family lines had representatives test within both a Y-line and mitochondrial DNA project.  The results were a paper, Melungeons: A Multi-Ethnic Population, published in JOGG, the Journal of Genetic Genealogy, in April 2012.

Many people expected to discover that the Melungeons were primarily Native American, but this was not the outcome of the DNA project.  In fact, many of the direct paternal male lines were African and all of the direct maternal female lines tested were European.  While there are paper records, in one case, that state that one of the ancestors of the Melungeons was Native American (Riddle), and there is DNA testing of another line that married into the Melungeon families that proves that indirect line is Native American (Sizemore), there is no direct line testing that indicates Native ancestry.

Aside from the uproar the results caused among researchers who were hopeful of a different outcome, it also begs the question of whether the documents we do have of those families support the DNA results.  What did the contemporary people who knew them during their lifetime think about their race?  Census takers, tax men and county clerks?  Are there patterns that emerge?  Sometimes, when we receive new information, be it genetic or otherwise, we need to revisit our documentation and look with a new set of eyes.

It’s common practice in genetic genealogy circles when “undocumented adoptions” are discovered, for example, to revisit the census and look for things like a child’s birthdate being before the parents’ marriage.  Something that went unnoticed during initial data gathering or was assumed to be in error suddenly becomes extremely important, perhaps the key to unraveling what happened to those long-ago ancestors.  Like in all projects, some descendant lines we expected to match, didn’t.

Recently Jack Goins undertook such an analysis of the documentary records collected over the years in the various counties where the Melungeon families or their direct ancestors lived.  We know that today, and in the 1900s, most of these families appear physically primarily European, an observation supported by autosomal DNA testing.  So we’re looking for records that indicate minority admixture.

Do the records indicate that these people were black, Native, European, mixed or something else, like Portuguese?  Was the African admixture recent, so recent that their descendants were viewed as mixed-race, or were the African haplogroups introduced long ago, hundreds or thousands of years ago perhaps, maybe in Mediterranean Europe?  If that was the case, then the Melungeon ancestors in America would have been considered “European,” meaning they looked white.  What do the records say about these families?  Were they uniformly considered white, black, mixed or Native in all of the locations where family members moved as they dispersed out of colonial Virginia?

If these men were Native Americans, would they have likely fought against the Indians in the French and Indian War in 1754?  Melungeon ancestors did just that and they are specifically noted as fighting “against the Shawnee.”  Their families were found in census records as “free people of color” and “mulatto” countless times which indicates they were not slaves and were not white.  On one later census record, below, in 1880, Portugee was overstricken and W for white entered.

1880 census
1880 census 2

Melungeon families and their ancestors were listed on tax records and other records as mulattoes, never as mustee and only once as Indian.  Mulattoes are typically mixed black and white, although it can be Native and white, while mustee generally means mixed Indian with something else.  On one 1767 tax list, Moses Riddle, a maternal ancestor of a Melungeon family is listed as Indian, but this is the only instance found in the hundreds of records searched.  The Riddle family paternal haplogroup reflects European ancestry so apparently the Indian ancestor originated in a maternal line.

Court records identify Melungeon families as “colored” and “black” and “African” and “free negroes and mulattoes” as well as white.  In the 1840s, a group of Melungeon men, descendants of these individuals classified as mulattoes and free people of color were prosecuted for voting, a civil liberty forbidden to those “not white,” and probably as a political move to make examples of them.  Some of these men were found not guilty, one simply paid the fine, probably to avoid prosecution due to his advanced age, and the cases were dismissed against the rest.  Some were also prosecuted for bi-racial marriage when it was illegal for anyone of mixed heritage to marry a white person.  In earlier cases, in the 1700s in Virginia, these families were prosecuted for “concealing tithables” specifically for not listing their wives, “being mulattoes.”  In another case, the records indicate an individual being referred to as ‘yellow complected,’ a term often used for a light skinned mulatto.  And yet another case states that while the men were “mulattos,” their fathers were free and their wives were white.

There are many records, more than 1600 in total that we indexed and cataloged when writing the paper, and more have surfaced since.  In all of those records, only one contemporaneous record, the 1767 Riddle tax list, states the person was an Indian.  None, other than the 1880 census record, state that they were Portuguese.  There are many that indicate African or mixed heritage, of some description, and there are also many that don’t indicate any admixture.  Especially in later census, as the families outmarried to some extent, they were nearly uniformly listed as white.  Still, this group of people looked “different” enough from their neighbors to be labeled with the derisive name of Melungeon.

While this group, based on mitochondrial DNA testing, did initially marry European women, generations of intermarriage would have caused the entire group to be darker than the nonadmixed European population in the 1700s and 1800s.  By this time, neither they nor their neighbors were sure what they were, so they claimed Portuguese and Indian.  No one claimed to have black ancestors, in fact, most denied it vehemently.  By this time, so many generations had passed that they may not have known the whole truth, and there is indeed evidence of two Indian lines within the Melungeon community.

In light of these records, the DNA results should not have been as surprising as they were.  However, this body of research had never been analyzed as a whole before.

Since the original paper was published, four additional paternal lines documented as Melungeon but without DNA representation/confirmation in the original paper have tested, and all four of them, Nichols, Perkins, Shoemake/Shumach and Bolin/Bolton carry haplogroup E1b1a.  They are not matches to each other or other Melungeon paternal lines, so it’s not a matter of undocumented adoptions within a community.

The DNA project administrators certainly welcome additional participants who descend from the Melungeon families.  Y-line DNA requires a male who descends from a patriarch via all males, given that males pass their Y chromosome to only sons.

There may indeed be Native American lines yet undiscovered within the female or ancestral lines, and we are actively seeking people descended from the wives of these Melungeon families through all women. Mitochondrial DNA, which tests the maternal line, is passed to both genders of children, but only females pass it on.  So to represent your Melungeon maternal ancestor, you must descend from her through all females, but you yourself can be either male or female.

While the primary focus is still to document the various direct family lines utilizing Y-line and mitochondrial DNA, the advent of autosomal testing has opened the door for other Melungeon descendants to test as well.  In fact, the project administrators have organized a separate project for all descendants who have taken the autosomal Family Finder test at Family Tree DNA called the Melungeon Families project.

The list of eligible Melungeon surnames is Bell, Bolton, Bowling, Bolin, Bowlin, Breedlove, Bunch, Collins, Denham, Gibson, Gipson, Goins, Goodman, Minor, Moore, Menley, Morning, Mullins, Nichols, Perkins, Riddle, Sizemore, Shumake, Sullivan, Trent and Williams.  For specifics about the paternal lines, patriarchs and where these families are historically located, please refer to the paper.

Furthermore, anyone with documented proof of additional Melungeon families or surnames is encouraged to provide that as well.  Surnames are only added to the list with proof that the family was referenced as Melungeon from a documented historical record or is ancestral to a documented Melungeon family.  For example, the Sizemore family was never directly referred to as Melungeon in documented sources, but Aggy Sizemore (haplogroup H/European), daughter of George Sizemore (haplogroup Q/Native) married Zachariah Minor (haplogroup E1b1a/African).  The Minor family is one of the Melungeon family names.  So while Sizemore itself is not Melungeon, it is certainly an ancestral name to the Melungeon group.

For more information, read Jack Goins’ article, Written Records Agree with Melungeon DNA Results.