On April 25th, DNA Day and Arbor Day, Family Tree DNA updated and released their 2014 Y haplotree created in partnership with the Genographic project. This has been a massive project, expanding the tree from about 850 SNPs to over 6200, of which about 1200 are “terminal,” meaning the end of a branch, and the rest being proven to be duplicates.
If you’re a newbie, this would be a good place perhaps to read about what a haplogroup is and the new Y naming convention which replaces the well-known group names like R1b1a2 with the SNP shorthand version of the same haplogroup name, R-M269. From this time forward, the haplogroups will be known by their SNP names and the longhand version is obsolete, although you will always see it in older documents, articles and papers. In fact, this entire tree has been made possible by SNP testing by both academic organizations and consumers. To understand the difference between regular STR marker testing and SNP testing, click here.
I’ve divided this article into two parts. The first part is the “what did they do and why” part and the second is the “what does it mean to you” portion.
This tree update has been widely anticipated for some time now. We knew that Family Tree DNA was calibrating the tree in partnership with the Genographic project, but we didn’t know what else would be included until the tree was released.
What Did Family Tree DNA Do, and Why?
Janine Cloud, the liaison at Family Tree DNA for Project Administrators has provided some information as to the big picture.
“First, we’re committed to the next iteration of the tree and it will be more comprehensive, but we’re going to be really careful about the data we use from other sources. It HAS to be from raw data, not interpreted data. Second, I’ve italicized what I think is really the mission statement for all the work that’s been done on this tree and that will be done in the future.”
Janine interviewed Elliott Greenspan of Family Tree DNA about the new tree, and here are some of the salient points from that discussion.
“This year we’re committing to launching another tree. This tree will be more comprehensive, utilizing data from external sources: known Sanger data, as well as data such as Big Y, and if we have direct access to the raw data to make the proof (from large companies, such as the Chromo2) or a publication, or something of that nature. That is our intention that it be added into the data.
We’re definitely committed to update at least once per year. Our intention is to use data from other sources, as well as any SNPs we can, but it must be well-vetted. NGS and SNP technology inherently has errors. You must curate for those errors otherwise you’re just putting slop out to customers. There are some SNPs that may bind to the X chromosome that you didn’t know. There are some low coverages that you didn’t know.
With technology such as this you’re able to overcome the urge to test only what you’re likely to be positive for, and instead use the shotgun method and test everything. This allows us to make the discovery that SNPs are not nearly as stable as we thought, and they have a larger potential use in that sense.
Not only does the raw data need to be vetted but it needs to make sense. Using Geno 2.0, I only accepted samples that had the highest call rate, not just because it was the best quality but because it was the most data. I don’t want to be looking at data where I’m missing potential information A, or I may become confused by potential information B. That is something that will bog us down. When you’re looking at large data sets, I’d much rather throw out 20% of them because they’re going to take 90% of the time than to do my best to get 1 extra SNP on the tree or 1 extra branch modified, that is not worth all of our time and effort. What is, is figuring out what the broader scope of people are, because that is how you break down origins. Figuring one single branch for one group of three people is not truly interesting until it’s 50 people, because 50 people is a population. Three people may be a family unit. You have to have enough people to determine relevance. That’s why using large datasets and using complete datasets are very, very important.
I want it to be the most accurate tree it can be, but I also want it to be interesting. That’s the key. Historical relevance is what we’re to discover. Anthropological relevance. It’s not just who has the largest tree, it’s who can make the most sense out of what you have is important.”
Thanks to both Janine and Elliott for providing this information.
What is Provided in the Update?
The genetic genealogy community was hopeful that the new 2014 tree would be comprehensive, meaning that it would include not only the Genographic SNPs, but ones from Walk the Y, perhaps some Chromo2, Full Genomes results and the Big Y. Perhaps we were being overly optimistic, especially given the huge influx of new SNPs, the SNP tsunami as we call it, over the past few months. Family Tree DNA clearly had to put a stake in the sand and draw the line someplace. So, what is actually included, how did they select the SNPs for the new tree and how does this integrate with the Genographic information? This information was provided by Family Tree DNA.
Family Tree DNA created the 2014 Y-DNA Haplotree in partnership with the National Geographic Genographic Project using the proprietary GenoChip. Launched publicly in late 2012, the chip tests approximately 10,000 Y-DNA SNPs that had not, at the time, been phylogenetically classified.
The team used the first 50,000 male samples with the highest quality results to determine SNP positions. Using only tests with the highest possible “call rate” meant more available data, since those samples had the highest percentage of SNPs that produced results, or “calls.”
In some cases, SNPs that were on the 2010 Y-DNA Haplotree didn’t work well on the GenoChip, so the team used Sanger sequencing on anonymous samples to test those SNPs and to confirm ambiguous locations.
For example, if it wasn’t clear if a clade was a brother (parallel) clade, or a downstream clade, they tested for it.
The scope of the project did not include going farther than SNPs currently on the GenoChip in order to base the tree on the most data available at the time, with the cutoff for inclusion being about November of 2013.
Where data were clearly missing or underrepresented, the team curated additional data from the chip where it was available in later samples. For example, there were very few Haplogroup M samples in the original dataset of 50,000, so to ensure coverage, the team went through eligible Geno 2.0 samples submitted after November, 2013, to pull additional Haplogroup M data. That additional research was not necessary on, for example, the robust Haplogroup R dataset, for which they had a significant number of samples.
Family Tree DNA, again in partnership with the Genographic Project, is committed to releasing at least one update to the tree this year. The next iteration will be more comprehensive, including data from external sources such as known Sanger data, Big Y testing, and publications. If the team gets direct access to raw data from other large companies’ tests, then that information will be included as well. We are also committed to at least one update per year in the future.
Known SNPs will not intentionally be renamed. Their original names will be used since they represent the original discoverers of the SNP. If there are two names, one will be chosen to be displayed and the additional name will be available in the additional data, but the team is taking care not to make synonymous SNPs seems as if they are two separate SNPs. Some examples of that may exist initially, but as more SNPs are vetted, and as the team learns more, those examples will be removed.
In addition, positions or markers within STRs, as they are discovered, or large insertion/deletion events inside homopolymers, potentially may also be curated from additional data because the event cannot accurately be proven. A homopolymer is a sequence of identical bases, such as AAAAAAAAA or TTTTTTTTT. In such cases it’s impossible to tell which of the bases the insertion is, or if/where one was deleted. With technology such as Next Generation Sequencing, trying to get SNPs in regions such as STRs or homopolymers doesn’t make sense because we’re discovering non-ambiguous SNPs that define the same branches, so we can use the non-ambiguous SNPs instead.
Some SNPs from the 2010 tree have been intentionally removed. In some cases, those were SNPs for which the team never saw a positive result, so while it may be a legitimate SNP, even haplogroup defining, it was outside of the current scope of the tree. In other cases, the SNP was found in so many locations that it could cause the orientation of the tree to be drawn in more than one way. If the SNP could legitimately be positioned in more than one haplogroup, the team deemed that SNP to not be haplogroup defining, but rather a high polymorphic location.
To that end, SNPs no longer have .1, .2, or .3 designations. For example, J-L147.1 is simply J-L147, and I-147.2 is simply I-147. Those SNPs are positioned in the same place, but back-end programming will assign the appropriate haplogroup using other available information such as additional SNPs tested or haplogroup origins listed. If other SNPs have been tested and can unambiguously prove the location of the multi-locus SNP for the sample, then that data is used. If not, matching haplogroup origin information is used.
We will also move to shorthand haplogroup designations exclusively. Since we’re committing to at least one iteration of the tree per year, using longhand that could change with each update would be too confusing. For example, Haplogroup O used to have three branches: O1, O2, and O3. A SNP was discovered that combined O1 and O2, so they became O1a and O1b.
There are over 1200 branches on the 2014 Y Haplogroup tree, as compared to about 400 on the 2010 tree. Those branches contain over 6200 SNPs, so we’ve chosen to display select SNPs as “active” with an adjacent “More” button to show the synonymous SNPs if you choose.
In addition to the Family Tree DNA updates, any sample tested with the Genographic Project’s Geno 2.0 DNA Ancestry Kit, then transferred to FTDNA will automatically be re-synched on the Geno side. The Genographic Project is currently integrating the new data into their system and will announce on their website when the process is complete in the coming weeks. At that time, all Geno 2.0 participants’ results will be updated accordingly and will be accessible via the Genographic Project website.
- Created in partnership with National Geographic’s Genographic Project
- Used GenoChip containing ~10,000 previously unclassified Y-SNPs
- Some of those SNPs came from Walk Through the Y and the 1000 Genome Project
- Used first 50,000 high-quality male Geno 2.0 samples
- Verified positions from 2010 YCC by Sanger sequencing additional anonymous samples
- Filled in data on rare haplogroups using later Geno 2.0 samples
- Expanded from approximately 400 to over 1200 terminal branches
- Increased from around 850 SNPs to over 6200 SNPs
- Cut-off date for inclusion for most haplogroups was November 2013
Total number of SNPs broken down by haplogroup
- Existing customers receive free update to predictions and confirmed branches based on existing SNP test results.
- Haplogroup badge updated if new terminal branch is available
- Updated haplotree design displays new SNPs and branches for your haplogroup
- Branch names now listed in shorthand using terminal SNPs
- For SNPs with more than one name, in most cases the original name for SNP was used, with synonymous SNPs listed when you click “More…”
- No longer using SNP names with .1, .2, .3 suffixes. Back-end programming will place SNP in correct haplogroup using available data.
- SNPs recommended for additional testing are pre-populated in the cart for your convenience. Just click to remove those you don’t want to test.
- SNPs recommended for additional testing are based on 37-marker haplogroup origins data where possible, 25- or 12-marker data where 37 markers weren’t available.
- Once you’ve tested additional SNPs, that information will be used to automatically recommend additional SNPs for you if they’re available.
- If you remove those prepopulated SNPs from the cart, but want to re-add them, just refresh your page or close the page and return.
- Only one SNP per branch can be ordered at one time – synonymous SNPs can possibly ordered from the Advanced Orders section on the Upgrade Order page.
- Tests taken have moved to the bottom of the haplogroup page.
- Group Administrator Pages will have longhand removed.
- At least one update to the tree to be released this year.
- Update will include: data from Big Y, relevant publications, other companies’ tests from raw data.
- We’ll set up a system for those who have tested with other big data companies to contribute their raw data file to future versions of the tree.
- We’re committed to releasing at least one update per year.
- The Genographic Project is currently integrating the new data into their system and will announce on their website when the process is complete in the coming weeks. At that time, all Geno 2.0 participants’ results will be updated accordingly and accessible via the Genographic Project website.
What Does This Mean to You?
On your welcome page, your badges are listed. Your badge previously would have included the longhand form of the haplogroup, such as R1b1a2, but now it shows R-M269.
Please note that badges are not yet showing on all participants pages. If yours aren’t yet showing, clicking on the Haplotree and SNP page under the YDNA option on the blue options bar where your more detailed information is shown, below.
Your Haplogroup Name
Your haplogroup is now noted only as the SNP designation, R-M269, not the older longhand names.
Haplogroup R is a huge haplogroup, so you’ll need to scroll down to see your confirmed or predicted haplogroup, shown in green below.
The redesigned haplotree page includes an option to order SNPs downstream of your confirmed or predicted haplogroup. This refines your haplogroup and helps isolate your branch on the tree. You may or may not want to do this. In some cases, this does help your genealogy, especially in cases where you’re dealing with haplogroup R. For the most part, haplogroups are more historical in nature. For example, they will help you determine whether your ancestors are Native American, African, Anglo Saxon or maybe Viking. Haplogroups help us reach back before the advent of surnames.
The new page shows which SNPs are available for you to order from the SNPs on the tree today, shown above, in blue to the right of the SNP branch.
SNPs not on the Tree
Not all known SNPs are on the tree. Like I said, a line in the sand had to be drawn. There are SNPs, many recently discovered, that are not on the tree.
To put this in perspective, the new tree incorporates 6200 SNPs (up from 850), but the Big Y “pool” of known SNPs against which Family Tree DNA is comparing those results was 36,562 when the first results were initially released at the end of February.
If you have taken advanced SNP testing, such as the Walk the Y, the Big Y, or tested individual SNPs, your terminal SNP may not be on the tree, which means that your terminal SNP shown on your page, such as R-M269 above, MAY NOT BE ACCURATE in light of that testing. Why? Because these newly discovered SNPs are not yet on the tree. This only affects people who have done advanced testing which means it does not affect most people.
You can order relevant SNPs for your haplogroup on the tree by clicking on the “Add” button beside the SNP.
You can order SNPs not on the tree by clicking on the “Advanced Order Form” link available at the bottom of the haplotree page.
If you’re not sure of what you want to do, or why, you might want to touch bases with your project administrators. Depending on your testing goal, it might be much more advantageous, both scientifically and financially, for you to take either the Geno2 test or the Big Y.
At this point, in light of some of the issues with the new release, I would suggest maybe holding tight for a bit in terms of ordering new SNPs unless you’re positive that your haplogroup is correct and that the SNP selection you want to order would actually be beneficial to you.
Words of Caution
This are some bugs in this massive update. You might want to check your haplogroup assignment to be sure it is reflected accurately based on any SNP testing you have had done, of course, excepting the very advanced tests mentioned above.
If you discover something that is inaccurate or questionable, please notify Family Tree DNA. This is especially relevant for project administrators who are familiar with family groups and know that people who are in the same surname group should share a common base haplogroup, although some people who have taken further SNP testing will be shown with a downstream haplogroup, further down that particular branch of the tree.
What kind of result might you find suspicious or questionable? For example, if in your surname project, your matching surname cousins are all listed at R-M269 and you were too previously, but now you’re suddenly in a different haplogroup, like E, there is clearly an error.
Any suspected or confirmed errors should be reported to Family Tree DNA.
They have made it very easy by providing a “Feedback” button on the top of the page and there is a “Y tree” option in the dropdown box.
For administrators providing reports that involve more than one participant, please send to Groups@familytreedna.com and include the kit numbers, the participants names and the nature of the issue.
Family Tree DNA provides a free webinar that can be viewed about the 2014 Y Tree release. You can see all of the webinars that are archived and available for viewing at: https://www.familytreedna.com/learn/ftdna/webinars/
The Genographic Project is in the process of updating to the same tree so their results can be synchronized with the 2014 tree. A date for this has not yet been released.
Family Tree DNA has committed to at least one more update this year.
I know that this update was massive and required extensive reprogramming that affected almost every aspect of their webpage. If you think about it, nearly every page had to be updated from the main page to the order page. The tree is the backbone of everything. I want to thank the Family Tree DNA and Genograpic combined team for their efforts and Bennett Greenspan for making sure this did happen, just as he committed to do in November at the last conference.
Like everyone else, I want everything NOW, not tomorrow. We’re all passionate about this hobby – although I think it is more of a life mission for many – and surpassed hobby status long ago.
I know there are issues with the tree and they frustrate me, like everyone else. Those issues will be resolved. Family Tree DNA is actively working on reported issues and many have already been fixed.
There is some amount of disappointment in the genetic genealogy community about the SNPs not included on the tree, especially the SNPs recently discovered in advanced tests like the Big Y. Other trees, like the ISOGG tree, do in fact reflect many of these newly discovered SNPs.
There are a couple of major differences. First, ISOGG has an virtual army of volunteers who are focused on maintaining this tree. We are all very lucky that they do, and that Alice Fairhurst coordinates this effort and has done so now for many years. I would be lost without the ISOGG tree.
However, when a change is made to the ISOGG tree, and there have been thousands of changes, adds and moves over the years, nothing else is affected. No one’s personal page, no one’s personal tree, no projects, no maps, no matches and no order pages. ISOGG has no “responsibility” to anyone – in other words – it’s widely known and accepted that they are a volunteer organization without clients.
Family Tree DNA, on the other hand has half a million (or so) paying customers. Tree changes have a huge domino ripple effect there – not only on their customers’ personal pages, but to their entire website, projects, support and orders. A change at Family Tree DNA is much more significant than on the ISOGG page – not to mention – they don’t have the same army of volunteers and they have to rely on the raw science, not interpretation, as they said in the information they provided. A tree update at Family Tree DNA is a very different animal than updating a stand-alone tree, especially considering their collaboration with various scientific organizations, including the National Geographic Society.
I commend Family Tree DNA for this update and thank them for the update and the educational materials. I’m also glad to see that they do indeed rely only on science, not interpretation. Frustrating to the genetic genealogist in me? Sure. But in the long run, it’s worth it to be sure the results are accurate.
Could this release have been smoother and more accurate? Certainly. Hopefully this is the big speed bump and future releases will be much more graceful. It’s easy to see why there aren’t any other companies providing this type of comprehensive testing. It’s gone from an easy 12 marker “do we match” scenario to the forefront of pioneering population genetics. And all within a decade. It’s amazing that any company can keep up.