How is a new mitochondrial DNA haplogroup defined? What is the criteria and who decides?
My cousin posed this question and it’s something I’ve wondered about myself.
Before when I asked this question, I was told that the answer was three different sequences with the same mutation. But that can’t be the whole story, because when I work on the DNA Reports for people, I see this all the time and they clearly aren’t being grouped into subclades. Furthermore, if that was the case, there would be as many subclades as people – well not quite – but there would certainly be an overwhelming number.
So, what is the decision criteria for a new haplogroup subgroup definition for mtdna?
I asked Bill Hurst. Bill is a long time project administrator and worked closely with Doron Behar on the RSRS (Reconstructed Sapiens Reference Sequence) project. I knew he would be very familiar with the inner workings of this process, and he’s not entirely covered up by other projects. Bill is in the middle of his annual cross-country trek that always winds up in Houston the first week in November. Odd coincidence, that’s when the Family Tree DNA Conference takes place:)
I want to thank Bill for taking his time to answer this, especially while on the road. Here’s what Bill had to say. The brackets and footnote are mine for clarification.
“First, we are talking about we usually call mtDNA subclades, not haplogroups. The basic haplogroups have been set in stone for years now. Of course, it can be confusing. If U is a macrohaplogroup or superhaplogroup, then U8 is considered a haplogroup and U8b a subclade. However, it is now known that K is part of U8b; so you have haplogroup below a subclade on the mtDNA tree. Everything below K is again a subclade. (As usual, pardon me for using K examples; it’s what I know.)
Traditionally, subclades were introduced only in peer-reviewed scientific papers. Each author made up his or her own rules. When I wanted to introduce two new subclades – K1a10 and K1a11 – in 2007, I wrote an article for the Journal of Genetic Genealogy. That method still works, but increasingly new subclades are first named on the PhyloTree at http://www.phylotree.org/ . Most mtDNA scientists support and use the PhyloTree.
The original paper introducing the PhyloTree in 2008 – http://onlinelibrary.wiley.com/doi/10.1002/humu.20921/pdf – said: “a relatively stable (set of) mutation(s) must be shared by at least three complete sequences before assigning it the haplogroup status.” (Oops! Even they use “haplogroups” here.) But then it lists exceptions. Some one-sequence subclades were “grandfathered” in. They also discussed subclades with “preliminary status,” but I don’t see that being used recently.
Most importantly, I’ve found that the PhyloTree will accept a subclade with only two sequences if the defining mutation is in the coding-region and both sequences include additional coding-region mutations. The sequences to support the subclade must not be identical. Heteroplasmies [mutations in process-more about these in a future posting] are not sufficient to define or support a subclade, even if they are in the coding region. Rare or non-recurrent HVR [Hyper Variable Region] mutations may be acceptable as definers or supporters. For example, 497T in HVR2 is the sole defining mutation for subclade K1a, which includes about 60% of K. But if the HVR mutations are used as supporters, three sequences would probably be required.
Examples of even recurrent mutations being used as sole subclade definers include 16270T and 16222T for subclades K2b1a and K2b1a1. But in those cases, many examples had to be found before they were allowed to be definers. I’ve proposed 16223T as a definer for a K1a1b1a”1”, but have been unsuccessful so far. That mutation is not recurrent in K, but in mtDNA in general it is.
Some very recurrent mutations are used to head unlabeled branches on the tree; 195C heads a major branch that includes several subclades under K1a.
However, I’ve seen many branches, even with good defining mutations, where a large number of individual sequences only differ on recurrent HVR mutations such as the 523 insertions and deletions, 16093C, 146C, 152C, 195C, etc.; those don’t qualify for subclade labels and don’t show up on the PhyloTree.
Subclades may in some cases be defined or supported by insertions, deletions, and back mutations. My own K1c2a is defined solely by 15944d. [The letter d after the location number means a deletion has occurred at that location.]
It is very important that the sequences – full sequences only – used to define a subclade have to be published, usually in the GenBank database. FTDNA customers have used direct submissions, usually Ian Logan’s program, or have agreed to have their results transmitted with a scientific paper – so far that has been the Behar et al. (2012b) RSRS paper from last April. Almost one third of the mtDNA sequences on GenBank are now from FTDNA customers.
Some recent exceptions to direct GenBank publication are sequences from the 1000 Genomes Project, but even for those the underlying complete genomes are in GenBank. A group of Chinese scientists have now published two papers (Zheng et al. 2011 and Zheng et al. 2012) extracting the mtDNA results. The PhyloTree has used the first set of Chinese and Japanese sequences and will almost certainly use the second set that has European and other examples.
The moral of the story is that everyone with mtDNA FMS results should make sure their results get to GenBank one way or another. Don’t be deterred if you have exact matches there; the number of sequences and the geographical origins are of interest to some – including me. However, please don’t submit identical sequences of siblings or mothers and children.”
Administrator, mtDNA Haplogroup K and U8 Projects
 Mitochondrial DNA is made up of three hypervariable regions, where, like the name implies, mutations happen much more often than in the balance of the mitochondria, known as the coding region. There are three HVR regions, 1, 2 and 3. HVR1 is tested in Family Tree DNA’s mtDNA test, HVR2 and 3 are tested in the mtDNAPlus test and the coding region in the FMS (Full Mitochondrial Sequence) test. Other commercial labs generally only test some combination of the HVR regions, 1, 1+2 or 1-3. If medical conditions connected with the mitochondria are present, they are normally found in the coding region, which is why coding region records connected with testers are not found in a public database.