Million Mito Project Team – Introduction and Progress Update

Let me introduce you to the Million Mito Project team.

Left to right, Goran Runfeldt, Dr. Paul Maier, me, and Dr. Miguel Vilar. And yes, I know we look kind of like a band😊. The Merry Mito Band maybe, except, trust me, I can’t sing.

Yes, we finally, finally got to meet in person recently, and let me tell you, that was one joyful meeting. I hadn’t realized that while I know everyone, not everyone else had met in person before.

We have been working for almost two years together via Zoom, but separately. Just 10 days after the Million Mito Project was announced, we went into Covid lockdown.

It’s difficult to work remotely on such a huge collaborative project, but we have been making inroads, albeit slower than we had initially hoped.

Complicating this was the merger of FamilyTreeDNA with myDNA in January of 2021, with Bennett Greenspan stepping down as the CEO in that process. Bennett greenlit the Million Mito project initially. (Thank you, Bennett!)

Thankfully, the new CEO, Dr. Lior Rauschberger continued that greenlight without hesitation as soon our team was able to inform him about this wonderful scientific project that was underway. (Thank you, Lior!)

I can’t tell you what a HUGE relief that was.

While all change is challenging, and complicated by the Covid landscape, life events, and geographic distance, that merger really was the right decision. Lior is committed to scientific research, discovery, and the genealogy marketspace. He’s looking to expand, not contract.

You’re probably wondering where we are now in the Million Mito process.

Million Mito Project Update

I’d like to provide a brief update.

  • We have an academic paper in the final stages of the submission process, but this paper is not the final tree. It is, however, something extremely cool and important to the history of womankind! I can’t say more until publication, but I’ll write an article when the paper is published.
  • The team hopes to work with a million samples between all sources including FamilyTreeDNA testers, research-consented Genographic samples, Genbank, and other academic samples. Not all samples from those sources are full mitochondrial sequences, or necessarily pass our QC checks.

If you haven’t yet taken a full sequence test, you can help reach the one million goal by ordering a mitochondrial DNA test at FamilyTreeDNA, here. If you tested at a lower level some years back, please sign on to your account and upgrade so you can be a part of this scientific frontier.

  • We discovered that the authors of Phylotree never documented the “recipe” for reconstructing the tree behind the scenes, so we can’t exactly use the recipe for Phylotree as the basis for constructing a future tree.
  • We have been in the process of writing phylogenetic software that arrives at a similar tree to use as a baseline reference structure in order to preserve as many of the current Phylotree haplogroup names as possible.

Hand curation and placement is possible for hundreds or a few thousand samples, but it’s not possible for large numbers. While phylogenetic software to do this kind of work has existed for a long time, it typically can’t handle huge trees like what we are building.

Phylogenetic methods also struggle with highly recurrent mutations, and rapid star-burst expansions that we see on the human trees. A phylogenetic problem of this magnitude requires lots of innovations to correctly interpret lineage history from complex mutations.

Automated software to handle very large numbers of sequences must be adapted or developed.

  • Furthermore, simply building upon an existing scaffold without automating the process does not provide an ongoing, sustainable procedure to discover where new dividing branches are discovered internally within the tree, versus at the tips. In other words, adding new branches based on common mutations is only easy when you’re simply appending a new haplogroup to an existing one.

For example, I might have a new haplogroup J1c2f1 derived from J1c2f. That’s easy. It’s another matter entirely if haplogroup J1 itself, high up in the tree, were broken into multiple new branches. Only automated software can “reconstruct” the tree regularly to discover new major branches as the results of more testers become available.

Challenges

Let me share some examples of the kinds of challenges that we’ve encountered. Not only are these interesting, but they are also educational.

These figures are from Paul Maier’s RootsTech presentation, which I strongly recommend that you view, here.

Mitochondrial DNA is both fascinating and habit-forming. The more you know, the more you want to know.

Let’s start with the basics. Haplogroups are defined by one or more mutations that everyone upstream does NOT have, and everyone downstream DOES have.

Pretty simple so far, right!

Haplogroup-Defining Mutations

Here’s an example of a nice simple mutation that is one of the multiple mutations that define haplogroup L1, near the base of the mitochondrial tree (Mitochondrial Eve) in the center. At location 3666, the “normal” value is G, but in this branch, the G in that position has been replaced by an A.

You can see that the other haplogroups shown in the circle by black dots don’t have the G-to-A mutation at location 3666, but the red dot locations do carry that mutation. Therefore, G3666A is one of the mutations that defines haplogroup L1. Haplogroups can be defined by only one unique mutation, or multiple mutations.

Multiple Haplogroup-Defining Mutations

Haplogroups with multiple mutations that define that specific haplogroup are candidates to be split into multiple branches forming new haplogroups at some point in the future when other people test who have:

  1. One or the other of those mutations if there are only two
  2. A subset of the mutations
  3. But not all of the mutations

Click on images to enlarge

For example, in the view of the public mitochondrial haplotree at FamilyTreeDNA which you can view here, you see that haplogroup L1 is defined by a total of 6 mutations. Someday, people may test that only have half (or a portion) of those mutations which would cause haplogroup L1 to split or branch into two separate haplogroups.

Unstable Mutations

Some mitochondrial locations are unstable, such as 16519C, along with a few other hypervariable locations. By unstable, I mean that they have mutated back and forth in the tree many times. The historical branching patterns of such unstable mutations can be difficult to decipher (the technical term is “saturation”), suggesting perhaps that they should not be the foundation for a new haplogroup.

Do we ignore those unstable locations entirely?

After discounting those well-known unstable locations, we still find some mutations, often in the HVR (hypervariable) regions that occur close to 100 times in the full tree.

This mutation at location 150 from C to T occurred four distinct times just in this small subset of haplogroup L. You can see the 4 locations I’ve bracketed with red boxes.

Is C150T stable enough to form a haplogroup? Multiple haplogroups? Should it be used high in the tree if this affects the complete downstream structure?

This same mutation occurs additional times further downstream in the tree, as well.

Reverse Mutations

Of course, some haplogroups are defined by reverse mutations, where the original mutation reverts back to its original state.

What about locations that have as many as 3 reverse mutations, which means that one location mutates back and forth 6 times in total? Kind of like a drunken sailor zigging and zagging along the street.

If we counted each mutation and reversal as a new haplogroup, we would have 6 new haplogroups based on this one single location in one parent haplogroup. Is that accurate, or should we ignore it altogether?

Here’s an example of one mutation and a corresponding back mutation.

In this scenario, the mutation of location 7055 from A to G occurred once in the formation of haplogroup L1. However, a back mutation took place, signified by the ! (exclamation mark) after the A, which is a defining mutation for haplogroup L1c3. All of the other L1c haplogroups still carry the A to G mutation, while L1c3 does not.

In some scenarios, the same location bounces back and forth. Should it still be counted as a haplogroup defining mutation, or is it simply “noise”?

Heteroplasmies

How do heteroplasmies play into this scenario?

Heteroplasmies occur when more than one value is discerned in an individual’s DNA at a specific location. Heteroplasmies do not define haplogroups, but they are reported in your personal results.

To be reported as a heteroplasmy, both values need to be detected at a level of over 20%. In the above scenario, if both G and A were found greater than 20% of the time, it would be counted at a heteroplasmy with a special notation.

For example, if G and A are both found more than 20% of the time, the notation would be R instead of either G or A. If the location was G7055, above, and G and A were both found above 20%, the notation would be G7055R.

However, if G was found 81% of the time or more, then it would be counted as G, which is “normal,” and if A was found 81% of the time or more, then the value would be reported as A, a mutation. If we see the normal state of G, then an A, then a G, is that a mutation and a back mutation? How many samples would need to contain that back mutation to count it as a mutation and not an aberration, an undetected borderline heteroplasmy slipping back and forth over the threshold, or simply noise?

Transitions Versus Transversions

There are two types of mutations, transitions and transversions, that probably should be weighted differently – but how differently, and why?

Some types of mutations occur more easily than others and are therefore more common. Paul explains this very well in his RootsTech video, but in a nutshell, transitions between T/C and A/G are much more common than transversions between A/C, G/T, C/G, and A/T. Therefore, transversions are noted with a small letter, shown above as T7624a.

In phylogenetics, the rarer mutation which is chemically less likely to occur (transversion) is weighted more heavily than the likelier mutations (transitions).

Insertions

Insertions are another type of challenge. Insertions happen when extra DNA is inserted at a specific location, kind of like the genetic equivalent of cutting in line.

In this graphic, we see that at location 5899, there’s an extension of .XC, written as 5899.XC. This means that at this location, you’ll find an unknown or varying number of additional Cs inserted. Paul showed several example sequences in the box at upper left. In some people who have this mutation, there are only one or two inserted Cs. In other people, there are several Cs, shown in the bottom two sequences.

You might recognize this as a phenomenon similar to Y DNA STRs which are short tandem repeats. Of course, we don’t use STRs for haplogroup identification in Y DNA. How should we handle insertions, especially multiple insertions, in building the Mitotree?

Deletions

We see deletions of DNA too, indicated by a small “d” after the location. In some cases, we find large deletions.

At location 8281, there is a 9 base-pair deletion (8281 through 8289) that is one of the haplogroup defining mutations for haplogroup L0a2. We find a 9 base-pair deletion in exactly the same location again within subclades of haplogroups B and U.

Is there something about this specific location that makes it more prone to deletions, and specifically a deletion of exactly 9 base pairs?

Seeking Answers

Of course, we’re seeking all of these answers.

The team has been writing code to create structural trees based on various scenarios and trying to determine which ones make the most sense, all factors considered.

The current official tree, meaning the 2016 Build 17 version of Phylotree, is based on about 8,000 samples. Working with one million versus 8,000 is a challenge that ramps exponentially, necessitating substantial computing power.

Working with 125 times more data provides amazing potential, but it has also introduced challenges that never had to be addressed before. It’s evident, to us at least, why Phylotree wasn’t updated after 2016. The tools simply don’t exist.

Sneak Peek

We fully expect hundreds if not thousands of new haplogroups to form. Today, Paul’s haplogroup is U5a2b2a which was formed about 5,000 years ago during the Bronze Age.

The haplogroup itself is useful to determine roughly where your ancestors were at that time, and often provide information about more recent population group history, but you need mitochondrial DNA matching to provide more genealogically useful information.

Paul’s test results show that he has 8 extra mutations, which means those mutations are in addition to his haplogroup-defining mutations. These extra mutations are what make genealogical matching so useful.

Paul has 16 full sequence matches that match him at a genetic distance of 3 mutations or less, although due to privacy restrictions at FamilyTreeDNA, we can’t see which matches share which mutations.

Given that Paul has 8 extra mutations, this means that it’s possible that one or more new haplogroups will be formed using some or all of those 8 extra mutations, and that those people who match him at a GD of 3 or less will very likely be members of a newly formed haplogroup.

Here’s a comparison of Paul’s haplogroup today, at left, with the newly created U5a2b2a branch and resulting subclades in a beta version of our experimental Mitotree, at right. This moves Paul’s new haplogroup, the pink node at right, from 5,000 to 500 years ago which is clearly within a genealogically relevant timeframe.

The single haplogroup, U5a2b2a, now has been expanded to 7 subgroups. If U5a2b2a is representative of the expansion capability of the entire tree, that’s a 7-fold increase.

Of Paul’s 16 matches, those with the same new haplogroup are those where he needs to focus his genealogical research.

Where Are We?

This is not a commitment, but we expect to release a sneak preview of the new Mitotree this year.

If you have extra or missing mutations, especially in the coding region, you and your close matches may very well receive a new, expanded haplogroup.

Highly refined haplogroups will improve the ability to use mitochondrial DNA for genealogical purposes – similar to what the Big Y-700 SNP testing and the expanded haplotree have done for Y DNA.

Like with Y DNA, you’ll want to use your new haplogroup in combination with genealogical trees.

The more people that test, the more success stories emerge, and the more people that WILL test. Just think what would happen if everyone who took a Y or autosomal DNA test also took a mitochondrial DNA test. We’d be bulldozing through brick walls every day.

I don’t know about you, but I have so many women in my trees with no parents. I need more tools and can hardly wait.

Resources

The new Mitotree is fueled by the Million Mito Project which is fueled by full sequence DNA testing, so please purchase yours today.

And yes, in case you were wondering, the new Mitotree will be free and public, just like the existing Mitochondrial DNA Tree and Y DNA Tree are at FamilyTreeDNA today.

You can read more about the Million Mito project here and here.

You can watch Paul’s Million Mito RootsTech presentation, here.

Paul, Miguel and I will be co-presenting Mitochondrial DNA Academy on Saturday, April 23, during the ECCGC Conference which you can read about here and register here.

_____________________________________________________________

Follow DNAexplain on Facebook, here or follow me on Twitter, here.

Share the Love!

You’re always welcome to forward articles or links to friends and share on social media.

If you haven’t already subscribed (it’s free,) you can receive an email whenever I publish by clicking the “follow” button on the main blog page, here.

You Can Help Keep This Blog Free

I receive a small contribution when you click on some of the links to vendors in my articles. This does NOT increase the price you pay but helps me to keep the lights on and this informational blog free for everyone. Please click on the links in the articles or to the vendors below if you are purchasing products or DNA testing.

Thank you so much.

DNA Purchases and Free Uploads

Genealogy Products and Services

My Book

Genealogy Books

Genealogy Research

22 thoughts on “Million Mito Project Team – Introduction and Progress Update

    • Yes, if you have taken the full sequence test at FamilyTreeDNA, your results will be used in new haplogroup formation and you’re automatically included.

    • Facebook crops photos. The photo in the article renders correctly. I swear….I know nothing……:)

  1. What a great update! So you’re having to code these systems from scratch. A huge undertaking. It also makes sense how a large dataset of samples is necessary to make sense of which of these mutations are significant (or permanent).

    If it fits into your ECCGC talks, I’d be interested in hearing about how the analysis requirements for the mt haplotree differ from those for the Y. An expert manually reviews placement on the Y tree for Big Y results at FTDNA, and I assume it must be in conjunction with its own automated system. I’m curious how that got implemented at FTDNA even while the original mt Phylotree collapsed under its own weight.

    I’m also interested in what the tipping point would be to move away from the letter/number naming for mt haplogroups, or if haplogroup naming based on a terminal mutation is even feasible now.

    I am so excited about the work your team is doing. I have tested or prompted testing/upgrades for seven new mtFull sequences over the past year and a half (four Hs, and one each of U5, L2, and C), and I know there’s still a ways to go. Genealogists discouraging mitochondrial testing in favor of autosomal is definitely a self-fulfilling prophecy, and I have high hopes that we can change the odds with time. A million thank-yous to the Merry Mito Band!!

    • You’re asking the same questions we are asking too. The great news is that you’ll have three of four of us at ECCGC. Thank you so much for being a champion for testing.

  2. Roberta, I really appreciate the efforts of the million mito team, and I have already tested. It is an area that has great promise for us genealogists. If I may make one suggestion, is it possible to use high contrast colors in charts (e.g. orange and blue rather than red and black), so that the 8% of males on the planet that are color blind can read them? DNA graphics tend to be very colorful and we have immense trouble viewing charts such as ethnicity breakdowns that use a lot of colors. I would much rather see shapes or patterns used to denote various groups, but probably 96% of people would disagree.

  3. Oh, phylogenetics is fun!! Granted, I worked mostly with chloroplast DNA. In the words of my Grad Advisor, plant mito DNA is truely a scarey place. There’s only a couple of genes that have been found to work in some plants, but not Trillium.

    This is truely exciting work. Can’t wait to see the results. Do keep us posted about resulting publications. I’d do it in a heartbeat, but $$$. Hubby wants the full set!

    Really exciting work. After I finish the morphology paper, next up is the plastid paper!

  4. Thanks to the team for taking on this extremely large task. I am in awe.

    My MtDna full test results are at FTDNA and I have a one degree match tester that shares my 5th great grandmother as a MRCA.

    That said, I have failed at finding additional testers.

    One possibility for finding cousins in my matrilineal line was the connections at Rootstech. I did find one person but they never replied. ONE PROBLEM – I don’t think visitors at RootsTech even realize they have messages.

    But I think more of a problem is the lack of support in the professional genealogy community. I’ve listened to podcasts where the presenter says to to “Not Bother” with MtDna as it won’t get you much, or it only connects to females, that are too difficult to research !

    I don’t know how to overcome this problem. But you and the team are doing will go a long way to overcoming attitudes. So, thanks again to you and the team.

  5. Hi Roberta, maybe you can give me a clue as to a question I have. My wife from Chile tested her MTDNA in 2015 and has had no matches to date. Her haplogroup is C1b. There are 26 people shown as C1b including 10 from Chile but my wife has no matches. It is very curious to me. In the FTDNA mtdna haplogroup tree there are several sub groups such as C1b1, C1b2… Since she is C1b, does that mean she is not a member of any of those other groups?

    Thanks

    • Yes, that means she is not a member of those subgroups. She has enough different mutations to prevent matching. She must have rare DNA and no close matrilineal relatives have yet tested. That’s actually very exciting. What part of Chile are her ancestors from?

  6. This is a great project and I look forward to your updates. I did a full mt-DNA test years ago, and have no exact matches but many matches with a genetic distance =1. My paper tree goes back to my ggg-grandmother so I keep hoping more unknown close “cousins” will test.

  7. Perhaps I am a bit dense but I do not see anywhere instructions to contribute my Full Sequence MtDNA results to the project. I did my test some time ago now at FamilyTreeDNA.

  8. So if we are already included, having tested full multi sequence years ago where do we see results or what has been discovered. Do we have a website to go to?

  9. When I first read Professor Sykes book ‘7 Daughters of Eve’, I thought I would be satisfied to discover which of the ‘ Daughter’s’ I was descended from….ha….I am still like a dog with a bone years later. I have so many questions about J2a1a1, my closest matches are between Ireland and Western England as far as I can tell. When did we get separated? Which population did we arrive with? I noticed SNP tracker moved the line from Italy to Germany in the Mesolithic. I had thought I had it all figured out because my grandad on that side is the Italian Gaul branch of R1b, their fate was written in the stars and their Ancestors lived together for thousands of years in a little village in rural Lancashire…or maybe not.

    I then made the mistake of persuading my female cousin on my fathers mothers side to have her MtDNA tested, oh boy, X2c1 with three variants they have only seen once before in an Australian’s sample, 2 steps away. What? The amusing thing is, I have about 28 known relatives who carry that haplogroup, 4 generations of females live in the same household. It seems to have taken the Eastern route, I am sure that line must be responsible for me being related to an archaic find, a Greenland Eskimo.

    Anyway, lots of people will be hanging on hoping for more information. If we can be grouped closer it will narrow down our searches. I have somehow reabsorbed the variant between R and JT…what a mystery and things like that may perhaps be key? Thank you so much to everyone involved.

Leave a Reply