Let me introduce you to the Million Mito Project team.
Left to right, Goran Runfeldt, Dr. Paul Maier, me, and Dr. Miguel Vilar. And yes, I know we look kind of like a band😊. The Merry Mito Band maybe, except, trust me, I can’t sing.
Yes, we finally, finally got to meet in person recently, and let me tell you, that was one joyful meeting. I hadn’t realized that while I know everyone, not everyone else had met in person before.
We have been working for almost two years together via Zoom, but separately. Just 10 days after the Million Mito Project was announced, we went into Covid lockdown.
It’s difficult to work remotely on such a huge collaborative project, but we have been making inroads, albeit slower than we had initially hoped.
Complicating this was the merger of FamilyTreeDNA with myDNA in January of 2021, with Bennett Greenspan stepping down as the CEO in that process. Bennett greenlit the Million Mito project initially. (Thank you, Bennett!)
Thankfully, the new CEO, Dr. Lior Rauschberger continued that greenlight without hesitation as soon our team was able to inform him about this wonderful scientific project that was underway. (Thank you, Lior!)
I can’t tell you what a HUGE relief that was.
While all change is challenging, and complicated by the Covid landscape, life events, and geographic distance, that merger really was the right decision. Lior is committed to scientific research, discovery, and the genealogy marketspace. He’s looking to expand, not contract.
You’re probably wondering where we are now in the Million Mito process.
Million Mito Project Update
I’d like to provide a brief update.
- We have an academic paper in the final stages of the submission process, but this paper is not the final tree. It is, however, something extremely cool and important to the history of womankind! I can’t say more until publication, but I’ll write an article when the paper is published.
- The team hopes to work with a million samples between all sources including FamilyTreeDNA testers, research-consented Genographic samples, Genbank, and other academic samples. Not all samples from those sources are full mitochondrial sequences, or necessarily pass our QC checks.
If you haven’t yet taken a full sequence test, you can help reach the one million goal by ordering a mitochondrial DNA test at FamilyTreeDNA, here. If you tested at a lower level some years back, please sign on to your account and upgrade so you can be a part of this scientific frontier.
- We discovered that the authors of Phylotree never documented the “recipe” for reconstructing the tree behind the scenes, so we can’t exactly use the recipe for Phylotree as the basis for constructing a future tree.
- We have been in the process of writing phylogenetic software that arrives at a similar tree to use as a baseline reference structure in order to preserve as many of the current Phylotree haplogroup names as possible.
Hand curation and placement is possible for hundreds or a few thousand samples, but it’s not possible for large numbers. While phylogenetic software to do this kind of work has existed for a long time, it typically can’t handle huge trees like what we are building.
Phylogenetic methods also struggle with highly recurrent mutations, and rapid star-burst expansions that we see on the human trees. A phylogenetic problem of this magnitude requires lots of innovations to correctly interpret lineage history from complex mutations.
Automated software to handle very large numbers of sequences must be adapted or developed.
- Furthermore, simply building upon an existing scaffold without automating the process does not provide an ongoing, sustainable procedure to discover where new dividing branches are discovered internally within the tree, versus at the tips. In other words, adding new branches based on common mutations is only easy when you’re simply appending a new haplogroup to an existing one.
For example, I might have a new haplogroup J1c2f1 derived from J1c2f. That’s easy. It’s another matter entirely if haplogroup J1 itself, high up in the tree, were broken into multiple new branches. Only automated software can “reconstruct” the tree regularly to discover new major branches as the results of more testers become available.
Let me share some examples of the kinds of challenges that we’ve encountered. Not only are these interesting, but they are also educational.
These figures are from Paul Maier’s RootsTech presentation, which I strongly recommend that you view, here.
Mitochondrial DNA is both fascinating and habit-forming. The more you know, the more you want to know.
Let’s start with the basics. Haplogroups are defined by one or more mutations that everyone upstream does NOT have, and everyone downstream DOES have.
Pretty simple so far, right!
Here’s an example of a nice simple mutation that is one of the multiple mutations that define haplogroup L1, near the base of the mitochondrial tree (Mitochondrial Eve) in the center. At location 3666, the “normal” value is G, but in this branch, the G in that position has been replaced by an A.
You can see that the other haplogroups shown in the circle by black dots don’t have the G-to-A mutation at location 3666, but the red dot locations do carry that mutation. Therefore, G3666A is one of the mutations that defines haplogroup L1. Haplogroups can be defined by only one unique mutation, or multiple mutations.
Multiple Haplogroup-Defining Mutations
Haplogroups with multiple mutations that define that specific haplogroup are candidates to be split into multiple branches forming new haplogroups at some point in the future when other people test who have:
- One or the other of those mutations if there are only two
- A subset of the mutations
- But not all of the mutations
For example, in the view of the public mitochondrial haplotree at FamilyTreeDNA which you can view here, you see that haplogroup L1 is defined by a total of 6 mutations. Someday, people may test that only have half (or a portion) of those mutations which would cause haplogroup L1 to split or branch into two separate haplogroups.
Some mitochondrial locations are unstable, such as 16519C, along with a few other hypervariable locations. By unstable, I mean that they have mutated back and forth in the tree many times. The historical branching patterns of such unstable mutations can be difficult to decipher (the technical term is “saturation”), suggesting perhaps that they should not be the foundation for a new haplogroup.
Do we ignore those unstable locations entirely?
After discounting those well-known unstable locations, we still find some mutations, often in the HVR (hypervariable) regions that occur close to 100 times in the full tree.
This mutation at location 150 from C to T occurred four distinct times just in this small subset of haplogroup L. You can see the 4 locations I’ve bracketed with red boxes.
Is C150T stable enough to form a haplogroup? Multiple haplogroups? Should it be used high in the tree if this affects the complete downstream structure?
This same mutation occurs additional times further downstream in the tree, as well.
Of course, some haplogroups are defined by reverse mutations, where the original mutation reverts back to its original state.
What about locations that have as many as 3 reverse mutations, which means that one location mutates back and forth 6 times in total? Kind of like a drunken sailor zigging and zagging along the street.
If we counted each mutation and reversal as a new haplogroup, we would have 6 new haplogroups based on this one single location in one parent haplogroup. Is that accurate, or should we ignore it altogether?
Here’s an example of one mutation and a corresponding back mutation.
In this scenario, the mutation of location 7055 from A to G occurred once in the formation of haplogroup L1. However, a back mutation took place, signified by the ! (exclamation mark) after the A, which is a defining mutation for haplogroup L1c3. All of the other L1c haplogroups still carry the A to G mutation, while L1c3 does not.
In some scenarios, the same location bounces back and forth. Should it still be counted as a haplogroup defining mutation, or is it simply “noise”?
How do heteroplasmies play into this scenario?
Heteroplasmies occur when more than one value is discerned in an individual’s DNA at a specific location. Heteroplasmies do not define haplogroups, but they are reported in your personal results.
To be reported as a heteroplasmy, both values need to be detected at a level of over 20%. In the above scenario, if both G and A were found greater than 20% of the time, it would be counted at a heteroplasmy with a special notation.
For example, if G and A are both found more than 20% of the time, the notation would be R instead of either G or A. If the location was G7055, above, and G and A were both found above 20%, the notation would be G7055R.
However, if G was found 81% of the time or more, then it would be counted as G, which is “normal,” and if A was found 81% of the time or more, then the value would be reported as A, a mutation. If we see the normal state of G, then an A, then a G, is that a mutation and a back mutation? How many samples would need to contain that back mutation to count it as a mutation and not an aberration, an undetected borderline heteroplasmy slipping back and forth over the threshold, or simply noise?
Transitions Versus Transversions
There are two types of mutations, transitions and transversions, that probably should be weighted differently – but how differently, and why?
Some types of mutations occur more easily than others and are therefore more common. Paul explains this very well in his RootsTech video, but in a nutshell, transitions between T/C and A/G are much more common than transversions between A/C, G/T, C/G, and A/T. Therefore, transversions are noted with a small letter, shown above as T7624a.
In phylogenetics, the rarer mutation which is chemically less likely to occur (transversion) is weighted more heavily than the likelier mutations (transitions).
Insertions are another type of challenge. Insertions happen when extra DNA is inserted at a specific location, kind of like the genetic equivalent of cutting in line.
In this graphic, we see that at location 5899, there’s an extension of .XC, written as 5899.XC. This means that at this location, you’ll find an unknown or varying number of additional Cs inserted. Paul showed several example sequences in the box at upper left. In some people who have this mutation, there are only one or two inserted Cs. In other people, there are several Cs, shown in the bottom two sequences.
You might recognize this as a phenomenon similar to Y DNA STRs which are short tandem repeats. Of course, we don’t use STRs for haplogroup identification in Y DNA. How should we handle insertions, especially multiple insertions, in building the Mitotree?
We see deletions of DNA too, indicated by a small “d” after the location. In some cases, we find large deletions.
At location 8281, there is a 9 base-pair deletion (8281 through 8289) that is one of the haplogroup defining mutations for haplogroup L0a2. We find a 9 base-pair deletion in exactly the same location again within subclades of haplogroups B and U.
Is there something about this specific location that makes it more prone to deletions, and specifically a deletion of exactly 9 base pairs?
Of course, we’re seeking all of these answers.
The team has been writing code to create structural trees based on various scenarios and trying to determine which ones make the most sense, all factors considered.
The current official tree, meaning the 2016 Build 17 version of Phylotree, is based on about 8,000 samples. Working with one million versus 8,000 is a challenge that ramps exponentially, necessitating substantial computing power.
Working with 125 times more data provides amazing potential, but it has also introduced challenges that never had to be addressed before. It’s evident, to us at least, why Phylotree wasn’t updated after 2016. The tools simply don’t exist.
We fully expect hundreds if not thousands of new haplogroups to form. Today, Paul’s haplogroup is U5a2b2a which was formed about 5,000 years ago during the Bronze Age.
The haplogroup itself is useful to determine roughly where your ancestors were at that time, and often provide information about more recent population group history, but you need mitochondrial DNA matching to provide more genealogically useful information.
Paul’s test results show that he has 8 extra mutations, which means those mutations are in addition to his haplogroup-defining mutations. These extra mutations are what make genealogical matching so useful.
Paul has 16 full sequence matches that match him at a genetic distance of 3 mutations or less, although due to privacy restrictions at FamilyTreeDNA, we can’t see which matches share which mutations.
Given that Paul has 8 extra mutations, this means that it’s possible that one or more new haplogroups will be formed using some or all of those 8 extra mutations, and that those people who match him at a GD of 3 or less will very likely be members of a newly formed haplogroup.
Here’s a comparison of Paul’s haplogroup today, at left, with the newly created U5a2b2a branch and resulting subclades in a beta version of our experimental Mitotree, at right. This moves Paul’s new haplogroup, the pink node at right, from 5,000 to 500 years ago which is clearly within a genealogically relevant timeframe.
The single haplogroup, U5a2b2a, now has been expanded to 7 subgroups. If U5a2b2a is representative of the expansion capability of the entire tree, that’s a 7-fold increase.
Of Paul’s 16 matches, those with the same new haplogroup are those where he needs to focus his genealogical research.
Where Are We?
This is not a commitment, but we expect to release a sneak preview of the new Mitotree this year.
If you have extra or missing mutations, especially in the coding region, you and your close matches may very well receive a new, expanded haplogroup.
Highly refined haplogroups will improve the ability to use mitochondrial DNA for genealogical purposes – similar to what the Big Y-700 SNP testing and the expanded haplotree have done for Y DNA.
Like with Y DNA, you’ll want to use your new haplogroup in combination with genealogical trees.
The more people that test, the more success stories emerge, and the more people that WILL test. Just think what would happen if everyone who took a Y or autosomal DNA test also took a mitochondrial DNA test. We’d be bulldozing through brick walls every day.
I don’t know about you, but I have so many women in my trees with no parents. I need more tools and can hardly wait.
The new Mitotree is fueled by the Million Mito Project which is fueled by full sequence DNA testing, so please purchase yours today.
And yes, in case you were wondering, the new Mitotree will be free and public, just like the existing Mitochondrial DNA Tree and Y DNA Tree are at FamilyTreeDNA today.
You can read more about the Million Mito project here and here.
You can watch Paul’s Million Mito RootsTech presentation, here.
Paul, Miguel and I will be co-presenting Mitochondrial DNA Academy on Saturday, April 23, during the ECCGC Conference which you can read about here and register here.
Follow DNAexplain on Facebook, here or follow me on Twitter, here.
Share the Love!
You’re always welcome to forward articles or links to friends and share on social media.
If you haven’t already subscribed (it’s free,) you can receive an email whenever I publish by clicking the “follow” button on the main blog page, here.
You Can Help Keep This Blog Free
I receive a small contribution when you click on some of the links to vendors in my articles. This does NOT increase the price you pay but helps me to keep the lights on and this informational blog free for everyone. Please click on the links in the articles or to the vendors below if you are purchasing products or DNA testing.
Thank you so much.
DNA Purchases and Free Uploads
- FamilyTreeDNA – Y, mitochondrial and autosomal DNA testing
- MyHeritage DNA – Autosomal DNA test
- MyHeritage FREE DNA file upload – Upload your DNA file from other vendors free
- AncestryDNA – Autosomal DNA test
- 23andMe Ancestry – Autosomal DNA only, no Health
- 23andMe Ancestry Plus Health
Genealogy Products and Services
- MyHeritage FREE Tree Builder – Genealogy software for your computer
- MyHeritage Subscription with Free Trial
- Legacy Family Tree Webinars – Genealogy and DNA classes, subscription-based, some free
- Legacy Family Tree Software – Genealogy software for your computer
- Newspapers.com – Search newspapers for your ancestors
- NewspaperArchive – Search different newspapers for your ancestors
- DNA for Native American Genealogy – by Roberta Estes, for those ordering within the United States
- DNA for Native American Genealogy – for those ordering outside the US
- Genealogical.com – Lots of wonderful genealogy research books
- Legacy Tree Genealogists – Professional genealogy research