Generational Inheritance

Autosomal DNA testing has opened up the brave new world for genealogists.  Along with that opportunity comes some amount of frustration and sometimes desperation to wring every possible tidbit of information out of autosomal results, sometimes resulting in pushing the envelope of what the technology and DNA can tell us.

I often have clients who want me to take a look at DNA results from people several generations removed from each other and try to determine if the ancestors are likely to be brothers, for example.  While that’s fairly feasible in the first few generations, the further back in time one goes, the less reliably we can say much of anything about how DNA is transmitted.  Hence, the less we can say, reliably, about relationships between people.

The best we can ever do is to talk in averages.  It’s like a coin flip.  Take a coin out right now and flip it 10 times.  I just did, and did not get 5 heads and 5 tails, which the average would predict.  But averages are comprised of a large number of outcomes divided by the actual number of events.  That isn’t the same thing as saying if one repeats the event 10 times that you will have 5 heads and 5 tails, or the average.  Each of those 10 flips are entirely independent, so you could have any of 11 different outcomes:

  • 0 heads 10 tails
  • 1 head 9 tails
  • 2 heads 8 tails
  • 3 heads 7 tails
  • 4 heads 6 tails
  • 5 heads 5 tails
  • 6 heads 4 tails
  • 7 heads 3 tails
  • 8 heads 2 tails
  • 9 heads 1 tail
  • 10 heads 0 tails

What the average does say is that in the end, you are most likely to have an average of 5 heads and 5 tails – and the larger the series of events, the more likely you are to reach that average.

My 10 single event flips were 4 heads and 6 tails, clearly not the average.  But if I did 10 series of coin flips, I bet my average would be 5 and 5 – and at 100 flips, it’s almost assured to be 50-50 – because the population, or number of events, has increased to the point where the average is almost assured.

You can see above, that while the average does indeed map to 5-5, or the 50-50 rule, the results of the individual flips are no respecter of that rule and are not connected to the final average outcome.  For example, if one set of flips is entirely tails and one set of flips is entirely heads, the average is still 50/50 which is not at all reflective of the actual events.

And so it goes with inheritance too.

However, we have come to expect that the 50% rule applies most of the time.  We knowriffle shuffle that it does, absolutely, with parents.  We do receive 50% of our DNA from each parent, but which 50%?.  From there, it can vary, meaning that we don’t necessarily get 25% of each grandparent’s DNA.  So while we receive 50% in total from each parent, we don’t necessarily receive every other segment or location, so it’s not like a rifle card shuffle where every other card is interspersed.

If one parents DNA sequence is:


A child cannot be presumed to receive every other allele, shown in red below.


The child could receive any portion of this particular segment, all of it, or none of it.

So, if you don’t receive every other allele from a parent, then how do you receive your DNA and how does that 50% division happen?  The bottom line is that we don’t know, but we are learning.  This article is the result of a learning experience.

Over time, genetic genealogists have come to expect that we are most likely to receive 25% of our DNA from each grandparent – which is statistically true when there are enough inheritance events.  This reflects our expectation of the standard deviation, where about 2/3rds of the results will be within the closest 25% in either direction of the center.  You can see expected standard deviation here.

This means that I would expect an inheritance frequency chart to look like this.

expected inheritance frequency

In this graph above, about half of the time, we inherit 50% of the DNA of any particular segment, and the rest of the time we inherit some different amount, with the most frequently inherited amounts being closer to the 50% mark and the outliers being increasingly rare as you approach 0% and 100% of a particular segment.

But does this predictability hold when we’re not talking about hundreds of events….when we’re not talking about population genetics….but our own family genetics, meaning one transmission event, from parent to child?  Because if that expected 50% factor doesn’t hold true, then that affects DRAMATICALLY what we can say about how related we are to someone 5 or 6 generations ago and how can we analyze individual chromosome data.

I have been uncomfortable with this situation for some time now, and the increasing incidence of anecdotal evidence has caused me to become increasingly more uncomfortable.

There are repeated anecdotal instances of significant segments that “hold” intact for many generations.  Statistically, this should not happen.  When this does happen, we, as genetic genealogists, consider ourselves lucky to be one of the 1% at the end of spectrum, that genetic karma has smiled upon us.  But is that true?  Are we at the lucky 1% end of the spectrum?

This phenomenon is shown clearly in the Vannoy project where 5 cousins who descend from Elijah Vannoy born in 1786 share a very significant portion of chromosome 15.  These people are all 5 generations or more distantly related from the common ancestor, (approximate 4th cousins) and should share less than 1% of their DNA in total, and certainly no large, unbroken segments.   As you can see, below, that’s not the case.  We don’t know why or how some DNA clumps together like this and is transmitted in complete (or nearly complete) segments, but they obviously are.  We often call these “sticky segments” for lack of a better term.

cousin 1

I downloaded this chromosome 15 information into a spreadsheet where I can sort it by chromosome.  Below you can see the segments on chromosome 15 where these cousins match me.

cousin 2

Chromosome 15 is a total of 141 cM in length and has 17,269 SNPs.  Therefore, at 5 generations removed, we would expect to see these people share a total of 4.4cM and 540 SNPs, or less for those more distantly related.  This would be under the matching threshold at either Family Tree DNA or 23andMe, so they would not be shown as matches at all.  Clearly, this isn’t the case for these 5 cousins.  This DNA held together and was passed intact for a total of 25 different individual inheritance events (5 cousins times 5 events, or  generations, each.)  I wrote about this in the article titles “Why Are My Predicted Cousin Relationships Wrong?”

Finally, I had a client who just would not accept no for an answer, wanted desperately to know the genetically projected relationship between two men who lived in the 1700s, and I felt an obligation to look into generational inheritance further.

About this same time, I had been working with my own matches at 23andMe.  Two of my children have tested there as well, a son and a daughter, so all of my matches at 23andMe obviously match me, and may or may not match my children.  This presented the perfect opportunity to study the amount of DNA transmitted in each inheritance event between me and both children.

Utilizing the reports at, I was able to download all of my matches into a spreadsheet, but then to also download all of the people on my match list that all of my matches match too.

I know, that was a tongue twister.  Maybe an example will help.

I match John Doe.  My match list looks like this and goes on for 353 lines.

match list

I only match John Doe on one chromosome at one location.  But finding who else on my match list of 353 people that John Doe matches is important because it gives me clues as to who is related to whom and descends from the same ancestor.  This is especially true if you recognize some of the people that your match matches, like your first cousin, for example.  This suggests, below that John Doe is related to me through the same ancestor as my first cousin, especially if John matches me with even more people who share that ancestor.   If my cousin and I both match John Doe on the same segment, that is strongly suggestive that this segment comes from a common ancestor, like in the previous Vannoy example.

Therefore, I methodically went through and downloaded every single one of my matches matches (from my match list) to see who was also on their list, and built myself a large spreadsheet.  That spreadsheet exercise is a topic for another article.  The important thing about this process is that how much DNA each of my children match with John Doe tells me exactly how much of my DNA each of my children inherited from me, versus their father, in that segment of DNA.

match comparison

In the above example, I match John Doe on Chromosome 11 from 37,000,000-63,000,000.  Looking at the expected 50% inheritance, or normal distribution, both of my children should match John Doe at half of that.  But look at what happened.  Both of my children inherited almost exactly all of the same DNA that I had to give.  Both of them inherited just slightly less in terms of genetic distance (cM) and also in terms of the number of SNPs.

It’s this type of information that has made me increasingly skeptical about the 50% bell curve standard deviation rule as applied to individual, not population, genetics.  The bell curve, of course, implies that the 50% percentile is the most likely even to occur, with the 49th being next most likely, etc.

This does not seem to be holding true.  In fact, in this one example alone, we have two examples of nearly 100% of the data being passed, not 50% in each inheritance event.  This is the type of one-off anecdotal evidence that has been making me increasingly uncomfortable.

I wanted something more than anecdotal evidence.  I copied all of the match information for myself and my children with my matches to one spreadsheet.  There are two genetic measures that can be utilized, centimorgans (cM) or total SNPs. I am using cM for these examples unless I state otherwise.

In total, there were 594 inheritance events shown as matches between me and others, and those same others and my children.

Upon further analysis of those inheritance events, 6 of them were actually not inheritance events from me.  In other words, those people matched me and my children on different chromosomes.  This means that the matches to my children were not through me, but from their father’s side or were IBS, Inherited by State.

son daughter comparison

This first chart is extremely interesting.  Including all inheritance events, 55% of the time, my children received none of the DNA I had to give them.  Whoa Nellie.  That is not what I expected to see.  They “should have” received half of my DNA, but instead, half of the time, they received none.

The balance of the time, they received some of my DNA 23% of the time and all of my DNA 21% of the time.  That also is not what I expected to see.

Furthermore, there is only one inheritance event in which one of my children actually inherited exactly half of what I had to offer, so significantly less than 1% at .1%.  In other words, what we expected to see actually happened the least often and was vanishingly rare when not looking at averages but at actual inheritance events.

Let’s talk about that “none” figure for a minute.  In this case, none isn’t really accurate, but I can’t be more accurate.  None means that 23andMe showed no match.  Their threshold for matching is 7cM (genetic distance) and 700 SNPS for the first matching segment, and then 5cM and 700 SNPS for secondary matching segments.  However, if you have over 1000 matches, which I do, matches begin to “fall off,” the smallest ones first, so you can’t tell what the functional match threshold is for you or for the people you match.  We can only guess, based on their published thresholds.

So let’s look at this another way.

Of the 329 times that my children received none of my DNA, 105 of those transmissions would be expected to be under the 700cM threshold, based on a 50% calculation of how many cMs I matched with the individual.  However, not all of those expected events were actually under the threshold, and many transmissions that were not expected to be under that threshold, were.  Therefore, 224, or 68% of those “none” events were not expected if you look at how much of my DNA the child would be expected to inherit at 50%.

Another very interesting anomaly that pops right up is the number of cases where my children inherited more than I had to give them.  In the example below, you can see that I match Jane Doe with 15.2cM and 2859 SNPs, but my daughter matches Jane with 16.3cM and 2960 on the same chromosome.

spreadsheet layout

There are a few possibilities to explain this:

  • My daughter also matches this person on her father’s side at this transition point.
  • My daughter matches this person IBS at this point.
  • The 23andMe matching software is trying to compensate for misreads.
  • There are misreads or no calls in my file.

There of course may be a combination of several of these factors, but the most likely is the fact that she is IBS at this location and the matching software is trying to be generous to compensate for possible no-calls and misreads.  I suggest this because they are almost uniformly very small amounts.

Therefore when my children match me at 100% or greater, I simply counted it as an exact match.  I was surprised at how many of these instances there were.  Most were just slightly over the value of 2 in the “times expected” column.  To explain how this column functions, a value of 1 is the expected amount – or 50% of my DNA.  A value of 2 means that the child inherited all of the DNA I had to offer in that location.  Any value over 2 means that one or more of the bulleted possibilities above occurred.

Between both of my children, there were a total of 75, or 60% with values greater than 2 on cMs and 96, or 80%, on SNPs, meaning that my children matched those people on more DNA at that location than I had to offer.  The range was from 2 to 2.4 with the exception of one match that was at 3.7.  That one could well be a valid transition (other parent) match.

There has been a lot of discussion recently about X chromosome inheritance.  In this case, the X would be like any other chromosome, since I have two Xs to recombine and give to my children, so I did not remove X matches from these calculations.  The X is shown as chromosome 99 here and 23 on the graphs to enable correct column sorting/graphing.

In the chart below, inheritance events are charted by chromosome.  The “Total” columns are the combined events of both my son and daughter.  The blue and pink columns are the inheritance events for both of them, which equal the total, of course.

The “none” column reflects transmissions on that chromosome where my children received none of my DNA.  The “some” column reflects transmission events where my children received some portion of my DNA between 0 (none) and 100% (all).  The “all” column reflects events where my children received all of the DNA that I had to offer.

chromosomal comparison

I graphed these events.

total inheritance graph

The graph shows the total inheritance events between both of my children by chromosome.  Number 23 in these charts is the X chromosome.

son inheritance graph

daughter inheritance graph

These inheritance numbers cause me to wonder what is going on with chromosome 5 in the case of both my daughter and son, and also chromosome 6 with my son.  I wonder if this would be uniform across families relative to chromosome 5, or if it is simply an anomaly within my family inheritance events.  It seems odd that the same anomaly would occur with both children.

son daughter inheritance graph

What this shows is that we are not dealing with a distribution curve where the majority of the events are at the 50% level and those that are not are progressively nearer to the 50% level than either end.  In other words, the Expected Inheritance Frequency is not what was found.

expected inheritance frequency

The actual curve, based on the inheritance events observed here, is shown below, where every event that was over the value of 2, or 100%, was normalized to 2.  This graph is dramatically different than the expected frequency, above.

actual inheritance frequency

Looking at this, it becomes immediately evident that we inherit either all of nothing of our parents DNA segments 85% of the time, and only about 15% of the time we inherit only a portion of our parents DNA segments.  Very, very rarely is the portion we inherit actually 50%, one tenth of one percent of the time.

Now that we understand that individual generational inheritance is not a 50-50 bell curve event, what does this mean to us as genetic genealogists?

I asked fellow genetic genealogist, Dr. David Pike, a mathematician to look this over and he offered the following commentary:

“As relationships get more distant, the number of blocks of DNA that are likely to be shared diminishes greatly.  Once down to one block, then really there are three outcomes for subsequent inheritance:  either the block is passed intact, no part of it is passed on, or recombination happens and a portion of it is passed on.  If we ignore this recombination effect (which should rarely affect a small block) then the block is either passed on in an “all or nothing” manner.  There’s essentially no middle ground with small blocks and even with lots of examples it doesn’t really make sense to expect an average of 50%.  As an analogy, consider the human population:  with about half of us being female and about half of us being male, the “average” person should therefore be androgynous, and yet very few people are indeed androgynous.”

In other words, even if you do have a segment that is 10 cMs in length, it’s not 10 coin flips, it’s one coin flip and it’s going to either be all, nothing or a portion thereof, and it’s more than 6 times more likely to be all or nothing than to be a partial inheritance.

So how do we resolve the fact that when we are looking at the 700,000 or so locations tested at Family Tree DNA and the 600,000 locations tested at 23andMe, that we can in fact use the averages to predict relationships, at least in closely related individuals, but we can’t utilize that same methodology in these types of individual situations?  There are many inheritance events being taken into consideration, 600,000 – 700,000, an amount that is mathematically high enough to over overcome the individual inheritance issues.  In other words, at this level, we can utilize averages.  However, when we move past the larger population model, the individual model simply doesn’t fit anymore for individual event inheritance – in other words, looking at individual segments.

Dr. Pike was kind enough to explain this in mathematical terms, but ones that the rest of us can understand:

“I think that part of what is at stake is the distinction between continuous versus discrete events.  These are mathematical terms, so to illustrate with an example, the number line from 0 to 10 is continuous and includes *all* numbers between them, such 2.55, pi, etc.  A discrete model, however, would involve only a finite number of elements, such as just the eleven integers from 0 to 10 inclusive.  In the discrete model there is nothing “in between” consecutive elements (such as 3 and 4), whereas in the continuous model there are infinitely elements between them.

It’s not unlike comparing a whole spectrum against a finite handful of a few options.  In some cases the distinction is easily blurred, such as if you conduct a survey and ask people to rate a politician on a discrete scale of 0 to 10… in this case it makes intuitive sense to say that the politician’s average rating was 7.32 (for example) even though 7.32 was not one of the options within the discrete scale.

In the realm of DNA, suppose that cousins Alice and Bob share 9 blocks of DNA with each other and we ask how many blocks Alice is likely to share with Bob’s unborn son.  The answer is discrete, and with each block having a roughly 50/50 chance we expect that there will likely be 4 or 5 blocks shared by Alice and Bob Jr., although the randomness of it could result in anywhere from 0 to 9 of the blocks being shared.  Although it doesn’t make practical sense to say that “four and a half” blocks will likely be shared [well, unless we allow recombination to split a block and thereby produce a shared “half block”], there is still some intuitive comfort in saying that 4.5 is the average of what we would expect, but in reality, either 4 or 5 blocks are shared.

But when we get to the extreme situation of there being only 1 block, for which the discrete options are only 0 or 1 block shared, yes or no, our comfortable familiarity with the continuous model fails us.  There are lots of analogies here, such as what is the average of a coin toss, what is the average answer to a True/False question, what the average gender of the population, etc.

Discrete models with lots of options can serve as good approximations of continuous situations, and vice-versa, which is probably part of what’s to blame for confusion here.

Really DNA inheritance is discrete, but with very many possible segments [such as if we divided the genome up into 10 cM segments and asked how many of Alice’s paternal segments will be inherited by one of her children, we can get away with a continuous model and essentially say that the answer is roughly 50%.  Really though, if there are 3000 of these blocks, the actual answer is one of the integers:  0, 1, 2, …, 2999, 3000.  The reality is discrete even though we like the continuous model for predicting it.

However, discrete situations with very few options simply cannot be modelled continuously.”

Back to our situation where we are attempting to determine a relationship of 2 men born in the 1700s whose descendants share fragments of DNA today.  When we see a particularly large fragment of DNA, we can’t make any assumptions about age or how long it has been in existence by “reverse engineering” it’s path to a common ancestor by doubling the amount of DNA in every generation.  In other words, based on the evidence we see above, it has most likely been passed entirely intact, not divided.  In the case of the Vannoy DNA, it looks like the ends have been shaved a few times, but the majority of the segment was passed entirely intact.  In fact, you can’t double the DNA inherited by each individual 5 times, because in at least one case, Buster, doubling his total matching cM, 100, even once would yield a number of cM greater the size of chromosome 15 at 141 cM.

Conversely, when we see no DNA matches, for example, in people who “should be” distant cousins, we can’t draw any conclusions about that either.  If the DNA didn’t get passed in the first generation – and according to the numbers we just saw – 58% doesn’t get passed at all, and 26% gets passed in its entirety, leaving only about 15% to receive some portion of one parent’s DNA, which is uniformly NOT 50% except for one instance in almost 1000 events (.1%) – then all bets for subsequent generations are off – they can’t inherit their half if their half is already gone or wasn’t half to begin with.

Based on mathematical model, Probability of Recombination, Dr. Pike has this to say:

If I’m reading this right, a 10 cM block has a 10% chance of being split into parts during the recombination process of a single conception. Although 10% is not completely negligible, it’s small enough that we can essentially consider “all” or “nothing” as the two dominant outcomes.

This is the fundamental underlying reason why testing companies are hesitant to predict specific relationships – they typically predict ranges of relationships – 1st to 3rd cousin, for example, based on a combination of averages – of the percentages of DNA shared, the number of segments, the size of segments, the number of SNPs etc.  The testing company, of course, can have no knowledge of how our individual DNA is or was actually passed, meaning how much ancestral DNA we do or don’t receive, so they must rely on those averages, which are very reliable as a continuous population model, and apparently, much less so as discrete individual events.

I would suggest that while we certainly have a large enough sample of inheritance events between me and my two children to be statistically relevant, it’s not large enough study to draw any broad sweeping conclusions. It is, after all, only 3 people and we don’t know how this data might hold up compared to a much larger sample of family inheritance events.  I’d like to see 100 or 1000 of these types of studies.

I would be very interested to see how this information holds up for anyone else who would be willing to do the same type of information download of their data for parent/multiple sibling inheritance.  I will gladly make my spreadsheet with the calculations available as a template to anyone who wants to do the same type of study.

I wonder if we would see certain chromosomes that always have higher or lower generational inheritance factors, like the “none” spike we see on chromosome 5.  I wonder if we would see a consistent pattern of male or female children inheriting more or less (all or none) from their parents.  I wonder what other kinds of information would reveal itself in a larger study, and if it would enable us to “weight” match information by chromosome or chromosome/gender, further refining our ability to understand our genetic relationships and to more accurately predict relationships.

I want to thank Dr. David Pike for reviewing and assisting with this article and in particular, for being infinitely patient and making the application of the math to genetics understandable for non-mathematicians.  If you would like to see an example of Dr. Pike’s professional work, here is one of his papers.  You can find his personal web page here and his wonderful DNA analysis tools here.

Clovis People Are Native Americans, and from Asia, not Europe

In a paper published in Nature today, titled “The genome of a Late Pleistocene human from a Clovis burial site in western Montana,” by Rasmussen et al, the authors conclude that the DNA of a Clovis child is ancestral to Native Americans.  Said another way, this Clovis child was a descendant, along with Native people today, of the original migrants from Asia who crossed the Bering Strait.

This paper, over 50 pages including supplemental material, is behind a paywall but it is very worthwhile for anyone who is specifically interested in either Native American or ancient burials.  This paper is full of graphics and extremely interesting for a number of reasons.

First, it marks what I hope is perhaps a spirit of cooperation between genetic research and several Native tribes.

Second, it utilized new techniques to provide details about the individual and who in world populations today they most resemble.

Third, it utilized full genome sequencing and the analysis is extremely thorough.

Let’s talk about these findings in more detail, concentrating on information provided within the paper.

The Clovis are defined as the oldest widespread complex in North America dating fromClovis point about 13,000 to 12,600 calendar years before present.  The Clovis culture is often characterized by the distinctive Clovis style projectile point.  Until this paper, the origins and genetic legacy of the Clovis people have been debated.

These remains were recovered from the only known Clovis site that is both archaeological and funerary, the Anzick site, on private land in western Montana.  Therefore, the NAGPRA Act does not apply to these remains, but the authors of the paper were very careful to work with a number of Native American tribes in the region in the process of the scientific research.  Sarah L. Anzick, a geneticist and one of the authors of the paper, is a member of the Anzick family whose land the remains were found upon.  The tribes did not object to the research but have requested to rebury the bones.

The bones found were those of a male infant child and were located directly below the Clovis materials and covered in red ochre.  They have been dated  to about 12,707-12,556 years of age and are the oldest North or South American remains to be genetically sequenced.

All 4 types of DNA were recovered from bone fragment shavings: mitochondrial, Y chromosome, autosomal and X chromosome.

Mitochondrial DNA

The mitochondrial haplogroup of the child was D4h3a, a rather rare Native American haplogroup.  Today, subgroups exist, but this D4h3a sample has none of those mutations so has been placed at the base of the D4h3a tree branch, as shown below in a grapic from the paper.  Therefore, D4h3a itself must be older than this skeleton, and they estimate the age of D4h3a to be 13,000 plus or minus 2,600 years, or older.

Clovis mtDNA

Today D4h3a is found along the Pacific coast in both North and South America (Chile, Peru, Ecuador, Bolivia, Brazil) and has been found in ancient populations.  The highest percentage of D4h3a is found at 22% of the Cayapa population in Equador.  An ancient sample has been found in British Columbia, along with current members of the Metlakatla First Nation Community near Prince Rupert, BC.

Much younger remains have been found in Tierra del Fuego in South America, dating from 100-400 years ago and from the Klunk Mound cemetery site in West-Central Illinois dating from 1800 years ago.

It’s sister branch, D4h3b consists of only one D4h3 lineage found in Eastern China.

Y Chromosomal DNA

The Y chromosome was determined to be haplogroup Q-L54.  Haplogroup Q and subgroup Q-L54 originated in Asia and two Q-L54 descendants predominate in the Americas: Q-M3 which has been observed exclusively in Native-Americans and Northeastern Siberians and Q-L54.

The tree researchers constructed is shown below.

Clovis Y

They estimate the divergence between haplogroups Q-L54 and Q-M3, the two major haplogroup Q Native lines, to be about 16,900 years ago, or from between 13,000 – 19,700.

The researchers shared with us the methodology they used to determine when their most common recent ancestor (MCRA) lived.

“The modern samples have accumulated an average of 48.7 transversions [basic mutations] since their MCRA lived and we observed 12 in Anzick.  We infer an average of approximately 36.7 (48.7-12) transversions to have accumulated in the past 12.6 thousands years and therefore estimate the divergence time of Q-M3 and Q-L54 to be approximately 16.8 thousands years (12.6ky x 48.7/36.7).”


They termed their autosomal analysis “genome-wide genetic affinity.”  They compared the Anzick individual with 52 Native populations for which known European and African genetic segments have been “masked,” or excluded.  This analysis showed that the Anzick individual showed a closer affinity to all 52 Native American populations than to any extant or ancient Eurasian population using several different, and some innovative and new, analysis techniques.

Surprisingly, the Anzick infant showed less shared genetic history with 7 northern Native American tribes from Canada and the Artic including 3 Northern Amerind-speaking groups.  Those 7 most distant groups are:  Aleutians, East Greenlanders, West Greenlanders, Chipewyan, Algonquin, Cree and Ojibwa.

They were closer to 44 Native populations from Central and South America, shown on the map below by the red dots.  In fact, South American populations all share a closer genetic affinity with the Anzick individual than they do with modern day North American Native American individuals.

Clovis autosomal cropped

The researchers proposed three migration models that might be plausible to support these findings, and utilized different types of analysis to eliminate two of the three.  The resulting analysis suggests that the split between the North and South American lines happened either before or at the time the Anzick individual lived, and the Anzick individual falls into the South American group, not the North American group.  In other words, the structural split pre-dates the Anzick child.  They conclude on this matter that “the North American and South American groups became isolated with little or no gene flow between the two groups following the death of the Anzick individual.”  This model also implies an early divergence between these two groups.

Clovis branch

In Eurasia, genetic affinity with the Anzick individual decreases with distance from the Bering Strait.

The researchers then utilized the genetic sequence of the 24,000 year old MA-1 individual from Mal’ta, Siberia, a 40,000 year old individual “Tianyuan” from China and the 4000 year old Saqqaq Palaeo-Eskimo from Greenland.

Again, the Anzick child showed a closer genetic affinity to all Native groups than to either MA-1 or the Saqqaq individual.  The Saqqaq individual is closest to the Greenland Inuit populations and the Siberian populations close to the Bering Strait.  Compared to MA-1, Anzick is closer to both East Asian and Native American populations, while MA-1 is closer to European populations.  This is consistent with earlier conclusions stating that “the Native American lineage absorbed gene flow from an East Asian lineage as well as a lineage related to the MA-1 individual.”  They also found that Anzick is closer to the Native population and the East Asian population than to the Tianyuan individual who seems equally related to a geographically wide range of Eurasian populations.  For additional information, you can see their charts in figure 5 in their supplementary data file.

I have constructed the table below to summarize who matches who, generally speaking.

who matches who

In addition, a French population was compared and only showed an affiliation with the Mal’ta individual and generically, Tianyuan who matches all Eurasians at some level.


The researchers concluded that the Clovis infant belonged to a meta-population from which many contemporary Native Americans are descended and is closely related to all indigenous American populations.  In essence, contemporary Native Americans are “effectively direct descendants of the people who made and used Clovis tools and buried this child,” covering it with red ochre.

Furthermore, the data refutes the possibility that Clovis originated via a European, Solutrean, migration to the Americas.

I would certainly be interested to see this same type of analysis performed on remains from the eastern Canadian or eastern seaboard United States on the earliest burials.  Pre-contact European admixture has been a hotly contested question, especially in the Hudson Bay region, for a very long time, but we have yet to see any pre-Columbus era contact burials that produce any genetic evidence of such.

Additionally, the Ohio burial suggests that perhaps the mitochondrial DNA haplogroup is or was more widespread geographically in North American than is known today.  A wider comparison to Native American DNA would be beneficial, were it possible. A quick look at various Native DNA and haplogroup projects at Family Tree DNA doesn’t show this haplogroup in locations outside of the ones discussed here.  Haplogroup Q, of course, is ubiquitous in the Native population.

National Geographic article about this revelation including photos of where the remains were found.  They can make a tuft of grass look great!

Another article can be found at Voice of America News.

Science has a bit more.

Native American Maternal Haplogroup A2a and B2a Dispersion

Recently, in, they published a good overview of a couple of recently written genetic papers dealing with Native American ancestry.  I particularly like this overview, because it’s written in plain English for the non-scientific reader.

In a nutshell, there has been ongoing debate that has been unresolved surrounding whether or not there was one or more migrations into the Americas.  These papers use these terms a little differently.  They not only talk about entry into the Americas but also dispersion within the Americans, which really is a secondary topic and happened, obviously, after the initial entry event(s).

The primary graphic in this article, show below, from the PNAS article, shows the distribution within the Americas of Native American haplogroups A2a and B2a.

a2a, b2a

Schematic phylogeny of complete mtDNA sequences belonging to haplogroups A2a and B2a. A maximum-likelihood (ML) time scale is shown. (Inset) A list of exact age values for each clade. Credit: Copyright © PNAS, doi:10.1073/pnas.0905753107

As you can see, the locations of these haplogroups are quite different and the various distribution models set forth in the papers account for this difference in geography.

One of the aspects of this paper, and the two academic papers on which it is based, that I find particularly encouraging is that the researchers are utilizing full sequence mitochondrial DNA, not just the HVR1 or HVR1+HVR2 regions which has all too often been done in the past.  In all fairness, until rather recently, the expense of running the full sequence was quite high and there were few (if any) other results in the academic data bases to compare the results with.  Now, the cost is quite reasonable, thanks in part to genetic genealogy and new technologies, and so the academic testing standards are changing.  If you’ll note, Alessandro Achilli, one of the authors of these papers and others about Native Americans as well, also comments towards the end that full genome testing will be being utilized soon.  I look forward to this new era of research, not only for Native Americans but for all of us searching for our roots.

Read the paper at:

The original academic papers are found here and here.  I encourage anyone with a serious interest in this topic to read these as well.

Mitochondrial DNA Smartmatching – The Rest of the Story

Sometimes, a match is not a match.  I know, now I’ve gone and ruined your day…

One of the questions that everyone wants the answer to when looking at matches, regardless of what kind of DNA testing we’re talking about, is “how long ago?”  How long ago did I share a common ancestor with my match?  Seems like a pretty simple question doesn’t it?

The answer, especially with mitochondrial DNA is not terribly straightforward.  A perfect example of this fell into my lap this week, and I’m sharing it with you.

Mitochondrial DNA – A Short Primer

There are three regions that are tested in mitochondrial DNA testing for genealogy.  The HVR1 and HVR2 regions are tested at most testing companies, and at Family Tree DNA, the rest of the mitochondria, called the coding region, is tested as well with the mega or full mitochondrial sequence test.  This is the mitochondrial equivalent of Paul Harvey’s “the rest of the story,” and of course we all know that the real story is always in “the rest of the story” or he wouldn’t be telling us about it!

Many times, the rest of the story is critically important.  In mitochondrial DNA, it’s the only way to obtain your full haplogroup designation.  If you don’t want to just be haplogroup J or A or H, you can test the coding region by taking the full sequence test and find out that you’re J1c2 or A2 or H21, and discover the story that goes with that haplogroup.  Guaranteed, it’s a lot more specific than the one that goes with simple J, A or H.  Often it’s the difference between where your ancestor was 2000 years ago and 20,000 years ago – and they probably covered a lot of territory in 18,000 years!

Let’s take a quick look at mitochondrial DNA.

To begin with, the HVR1 and HVR2 regions are called HVR for a reason – it’s short for hypervariable.  And of course, that means they vary, or mutate, a lot more rapidly, as compared to the coding region of the mitochondrial DNA.

In layman’s terms, think of a clock.  No, not a digital clock, an old-fashioned alarm clock.

alarm clock

The entire mitochondrial DNA has 16,569 locations.  The HVR1 and HVR2 regions take up the space on the clock face from 5 till until 5 after the hour.   The rest is the coding region – the mitochondrial “rest of the story.”  The coding region mutates much slower than the two HVR regions.

Just to be sure we’re on the same page, let’s talk for just a minute about how mitochondrial haplogroup assignments work.  For a detailed discussion of haplogroup assignments and how they are done, see Bill Hurst’s discussion here.

Generally a base haplogroup can be reasonably assigned by HVR1 region testing, but not always.  Sometimes they change with full sequence testing – so what you think you know may not be the end result.

My full haplogroup is J1c2f.  My base haplogroup is J.  I’m on the first branch of J, J1.  On branch J1, I’m on the third stick, c, J1c.  On the third stick J1c, I’m on the second twig, J1c2.  On the second twig, J1c2, I’m leaf f, or J1c2f.  Each of these branches of haplogroup J is determined by a specific mutation that happened long ago and was then passed to all of that person’s offspring, between them and me today.  The question is always, how long ago?

Mutation Rates – How Long Ago is Long Ago?

While we have a tip calculator at Family Tree DNA for Y-line DNA to predict how long ago 2 Y-line matches shared a most recent common ancestor, we don’t have anything similar for mitochondrial DNA, partly because of the great variation in the mutation rates for the various regions of mitochondrial DNA.  Family Tree DNA does provide guidelines for the HVR1 region, but they are so broad as to be relatively useless genealogically.  For example, at the 50th percentile, you are likely to have a common ancestor with someone whom you match exactly on the HVR1 mutations in 52 generations, or about 1300 years ago, in the year 713.  Wait, I know just who that is in my family tree!

These estimates do not take into account the HVR2 or coding regions.

I did some research jointly with another researcher not long ago attempting to determine the mutation rate for those regions, and we found estimates that ranged from 500 years to several thousand years per mutation occurrence and it wasn’t always clear in the publications whether they were referring to the entire mitochondria or just certain portions.  And then there are those pesky hot-spots that for some reason mutate a whole lot faster than other locations.  We’re not even going there.  Suffice it to say there is a wide divergence in opinion among academics, so we probably won’t be seeing any type of mito-tip calculator anytime soon.

Enter SmartMatching

Family Tree DNA does their best to make our matches useful to us and to eliminate matches that we know aren’t genealogically relevant.

For example, this week, I was working on a client’s DNA Report.  Let’s call him Joe.  Joe is haplogroup J1c2.  I am haplogroup J1c2f.  J1c2f has one additional haplogroup defining mutation, in the coding region, that J1c2 does not have.

Joe and I did not show as matches at Family Tree DNA, even though our HVR1 and HVR2 regions are exact matches.  Now, for a minute, that gave me a bit of a start.  In fact, I didn’t even realize that we were exact matches until I was working with his results at MitoSearch and recognized my own User ID.

I had to think for a minute about why we would not be considered matches at Family Tree DNA, and I was just about ready to submit a bug report, when I realized the answer was my extended haplogroup.  This, by the way, is the picture-perfect example of why you need full sequence testing.

Family Tree DNA knows that we both tested at the full sequence level.  They know that with a different haplogroup, we don’t share a common ancestor in hundreds to thousands of years, so it doesn’t matter if we match exactly on the HVR1 and HVR2 levels, we DON’T match on a haplogroup defining mutation, which, in this case, happens to be in the coding region, found only with full sequence testing.  Even if we have only one mismatch at the full sequence level, if it’s a haplogroup defining marker, we are not considered matches.  Said a different way, if our only difference was location 9055 and 9055 was NOT a haplogroup defining mutation, we would have been considered a match on all three levels – exact matches at the HVR1 and HVR2 levels and a 1 mutation difference at the full sequence level.  So how a mutation is identified, whether it’s haplogroup defining or not, is critical.

In our case, I carry a mutation at marker 9055 in the coding region that defines haplogroup J1c2f.  Joe doesn’t have this mutation, so he is not J1c2f, just J1c2.  So we don’t match.

So – How Long Ago for Me and Joe?

Dr. Behar in his “Copernican Reassessment of the Mitochondrial DNA Tree,” which has become the virtual Bible of mtDNA, estimates that the J1c2f haplogroup defining mutation at location 9055 occurred about 2000 years ago, plus or minus another 3000 years, which means my ancestor who had that mutation could have lived as long ago as 5000 years.

The mutations that define haplogroup J1c2 occurred about 9800 years ago, plus or minus another 2000.  So we know that Joe and I share a common ancestor about 7,800 – 11,800 years ago and our lines diverged sometime between then and 2,000 – 5,000 years ago.  So, in round numbers our common ancestor lived between 2,000 and 9,800 years ago.  Not much chance of identifying that person!

The ability to eliminate “near-misses” where the HVR1+HVR2 matches but the people aren’t in the same haplogroup, which is extremely common in haplogroup H, is actually a very useful feature that Family Tree DNA nicknamed SmartMatching.  With over 1000 matches at the HVR1 level, more than 200 at the HVR1+HVR2 level and another 50+ at the full sequence level, Joe certainly didn’t need to have any “misleading” matches included that could have been eliminating by a logic process.

So while Joe and I match, technically, if you only look at the HVR1 and HVR2 levels, we don’t really match, and that’s not evident at MitoSearch or at Ancestry or anyplace else that does not take into consideration both full sequence AND haplogroup defining mutations.  Family Tree DNA is the only company that does this.  Ancestry does not test at the full sequence level, so you can’t even get a full haplogroup assignment there, which is another reason, aside from inaccurate matches, that Ancestry customers often retest at Family Tree DNA.

It’s interesting to think about the fact that 2 people can match exactly at the HVR1+HVR2 levels, but the distance of the relationship can be vastly different.  I also match my mother on the HVR1+HVR2 levels, exactly, and our common ancestor is her.  So the distance to a common ancestor with an exact HVR1+HVR2 match can be anyplace from one generation (Mom) to thousands of years (Joe), and there is no way to tell the difference without full sequence testing and in this case, SmartMatching.

And that, my friends, is the rest of the story!