Data Mining and Screen Scraping – Right or Wrong?

Data mining, also known as screen scraping has been occurring in the genetic genealogy community for some time now. I had hoped that peer pressure and time would take care of the issue and it would resolve itself, but it has not.

This topic has become somewhat of the pink elephant in the middle of the living room. People are whispering. Some people have adopted the pink elephant as a pet.  Some are trying to ignore it.  A few haven’t noticed and some just kind of accept its presence since no one seems to be able to convince it to leave.  But no one has yet to walk in, take a look, and say “Hey, there’s a pink elephant in the living room.”

pink elephant

Well folks, there’s a pink elephant in the living room and we’re going to talk about it today.

What is Screen Scraping and Data Mining?

Screen scraping and data mining is where (generally) robots visit certain sites online on a scheduled basis and harvest data that is residing there. The harvested data may be used privately after that, or may be reformatted and massaged and then displayed differently on a public site. No notification is given or permission is asked to use the data.

Screen scraping and data mining is different than one person doing a Google search for information about their genealogy or their ancestor utilizing online resources. Screen scraping or data mining is the capturing or targeting of entire data bases. Mining implies searching for just one type of data – like maybe a certain haplogroup – and scraping implies taking everything viewable.  Best case, it’s Google spidering sites for indexing.  Worst case, they are thieves in the night. Like many things, the technology can be used for bad or good.

Let me give you an example which illustrates how I initially discovered this issue.

I administer several projects at Family Tree DNA – both surname and haplogroup. One of my surname project members e-mailed me one day in March of 2013 with a jovial note about their “15 minutes of fame.” The essence of this is that they had just transferred their National Geographic results to Family Tree DNA and the next day, found their results with their new SNPs they were so proud of on a website in Russia. Because of the quality of the site and how quickly those results appeared, they presumed that this was a collaborate research effort between either Family Tree DNA and/or National Geographic and the Russian site.

I took a look, and sure enough, he was right. There, big as life, was his DNA SNPs, his surname and his kit number, on an unauthorized site. I clearly knew that the website was not collaborative, but I confirmed with Family Tree DNA just to be sure, who was aware of it but could not do anything about the screen scraping of the DNA projects.

At that point, my project member attempted to contact the Russian site owner to have the information removed and to ask how they obtained it in the first place.  There was no name on the semargl site, nor e-mail, only a form.  I also attempted to do so and even involved two intermediaries who also attempted to facilitate contact. The site in question had clearly advertised a haplogroup project so I reached out to those project admins to facilitate contact as well. The website owner never replied. However, two days later, the web site owner did remove the surname from the site, but all of the harvested information remains. You can see it for yourself today. Kit number 24162.

semargl1semargl2

In fact, this site has scraped and reconstructed almost all (if not all) of the haplogroup projects at Family Tree DNA. You can see them here.

I conducted a little experiment not long ago wherein I timed how long it took after results were posted at Family Tree DNA for them to appear on this site and it was generally between 24 and 48 hours.  I repeated that this week with my husband’s results which were already displayed on the semargl website (without his permission,) and sure enough, his Big Y results that are displayed on the haplogroup project page at Family Tree DNA were immediately updated on the semargl site with his new SNP information.

One of my haplogroup projects has SNPs “turned off” but the participants data and SNPs are harvested anyway, because the robots don’t just scrape haplogroup projects, but surname projects as well. And almost everyone who joins haplogroup projects joins surname projects.

Have you noticed that the response times at Family Tree DNA are sometimes slow? Well, when robots are searching every project for new results on a daily basis, it does indeed tax their systems.  We know the semargl site uses robots, but there may be more sites we aren’t aware of doing the same thing.

Remember when Ysearch was taken offline entirely and the following message was displayed?

“YSearch is currently unavailable due to an increase in abusive data mining by automated scripts. The site will be unavailable for an extended period of indeterminate duration.”

Well, robots at it again.

Ironically, one of the people I spoke to about this used the fact that YSearch was down to justify why the semargl site was so important – because they duplicated the YSearch info.

How Can They Do This?

The bottom of every single project page at Family Tree DNA displays copyright verbiage, as follows:

ftdna copyright

This clearly includes the contents.  In the context of Russia, where the semargl website is located, this doesn’t matter, but perhaps Judy Russell will tackle the topic of project content ownership relative to the US in one of her columns.

I assure you that I have never been contacted and many of my projects’ contents are shown on the semargl site, complete haplogroup project data along with many participants, specifically those with SNP tests, from surname projects.

If you have had any SNP testing at Family Tree DNA, your results are probably included in this data base.  If you want to see if your kit number is there, you can search by kit number, and just for yuks, try searching by surname too: http://www.semargl.me/en/dna/ydna/search/

When participants join projects, they can clearly expect their results to be shown on the associated project page at Family Tree DNA. In fact, that’s the whole point of genetic genealogy, to be able to find your paternal line, for example, or your genetic cousins. Sharing and comparing.

Do participants expect that their data will be scraped and displayed on a website in Russia, with or without their surname, and entirely without their permission or knowledge?  Many surname project administrators are probably entirely unaware of this themselves.

The answer to “how can they do that?” is that they are in Russia and they are not bound by any US copyright or any other US laws. If you have any doubt about that, think Edward Snowden and why he is in Russia. In fact, the only thing that binds them is a sense of ethics, what’s right and wrong, internet courtesy and a colloquial definition of fair use. As you might have noticed, none of these things are legally binding, especially not on people in Russia.

Ethics speaks for itself. This site obviously sees nothing wrong with taking or harvesting the data from elsewhere without notification or permission.  They also see nothing wrong with retaining, utilizing and displaying data even when it has been asked by the owner to be removed.  Internet courtesy or netiquette would indicate that you would ask permission or minimally, inform the individuals that you are using their data. And fair use would indicate that you credit the individuals for their work and that you would source your data. Given that individuals didn’t grant permission for their information to be included, one should at least have the opportunity for their data to be removed, if randomly discovered, but that isn’t the case.  This certainly explains why they were trying to remain anonymous a year ago, and refused contact.

As one participant said to me, “Just because the technology door can’t be locked to prevent this type of activity, does that make taking something that doesn’t belong to you any less of a theft?”

In discussions surrounding this topic, a highly respected project administrator said the following:

“I do not think any person today should have a reasonable expectation that anything displayed on the Internet can be expected not to be copied because it is public info – fair game to a third party as long as the fair use doctrine is observed. If I copied that particular person’s results to my website as an example of something it comes under fair use – as long as I indicate the source for the info. But when someone copies large numbers of items or fails to show the source of the info, it is no longer fair use.”

This isn’t the only situation like this, although it is by far the most blatant.

Recently, I saw a draft of a “paper” where an entire haplogroup project was “analyzed” using a third party tool without knowledge or involvement of the administrators, nor appropriate credit given for their project. Clearly, without their efforts in the project, the analysis paper could not have been written because the project would not exist. While that paper involves one person, this website involves many, is very public, and now the owner(s) have also formed and are part of a company. The website also solicits donations as well.

semargl sidebar

You’ll notice that YFull is advertised on their website, under the donate button. The ISOGG Wiki provides the following information about YFull.

“YFull.com was founded in 2013 and focuses on the interpretation of Y-chromosome sequences. The main aim of the project is to provide services for the analysis of full Y-chromosome raw data (BAM) files and convenient visualization. The data is collected and analysed and newly discovered single-nucleotide polymorphisms (SNPs) are placed on an experimental Y-tree. Haplogroup and thematic projects are offered. The YFull service is located in Moscow, Russia.”

The YFull product analysis deliverables have been covered by two bloggers here and here.

The YFull team is listed in the Wiki article as follows:

  • Vadim Urasin (aka Wertner): active participant of the DNA genealogical community since 2008, the developer of robots to collect Y-data from public sources, “Y-predictor” developer, FTDNA group administrator, developer of the Y-series SNPs (for R1a, J2b, R2a, Q, O etc).
  • Roman Sychev (aka Maximus Centurion): active participant of the DNA genealogical community since 2006, since 2007 as moderator dna-forums.org (aka Maximus), molgen.org, FTDNA group administrator, developer of the Z-series SNPs (for R1a, I1, J2b), developer of the Y-series SNPs (for R1a, I, R2a, J2b, Q, O etc).
  • Vladimir Tagankin (aka Semargl): active participant of the DNA genealogical community, the DNA database “semargl.me” developer, FTDNA group administrator and co-administrator, developer of the Z-series SNPs (for R1a, I, J2b), developer of the Y-series SNPs (for R1a, J2b, R2a, Q, O etc).

You’ll note that the team includes two people who are credited with developing the mining/screen scraping robots and the developer of the semargl.me database.  Also please note that all 3 are listed as group administrators at Family Tree DNA, which, given the circumstances, seems to be in violation of the Project Administrator Guidelines.  I wonder if Family Tree DNA is aware of this and if project members understand what their project administrator is doing with their DNA results.

I happened to be working with someone’s results who are in the R1a1a and Subclades project.  I noticed a familiar name among the project co-administrators at the bottom of the list.

semargl admin

I have not checked other projects.

This is particularly unfortunate, because the haplogroup projects have been key players in terms of encouraging SNP testing, sorting through results and defining key haplogroup subgroups.  Project participants join haplogroup projects to further science and research.  They expect the administrators to work with the results, but working with/ analyzing the results and reproducing the results on another site is not the same.  Furthermore, being both a project administrator and the same person whose robots are scraping the FTDNA project sites to reproduce elsewhere without permission seems like a wolf masquerading as a shepherd to gain access to lambs.

Of course, the fully sequenced Y results are not posted to the public pages of projects, so they can not be harvested in full by robots like the individual SNP results, including Nat Geo transfers and Walk the Y results. Enter the free analysis provided by YFull to individuals who receive their fully sequenced Y results from either the Big Y at Family Tree DNA or the Full Y from FullGenomes.

When I first looked, there were no terms and condition, but there are terms and conditions on the YFull site today, at the bottom of the main page.

YFull t&c

4.2 We may disclose to third parties, and/or use in our Services, “Aggregated Genetic and Self-Reported Information”, which is Genetic and Self-Reported Information that has been stripped of Registration Information and combined with data from a number of other users sufficient to minimize the possibility of exposing individual-level information while still providing scientific evidence. If you have given consent for your Genetic and Self-Reported Information to be used in YFull.com Research, we may include such information in Aggregated Genetic and Self-Reported Information intended to be published in peer-reviewed scientific journals. We emphasize that Aggregated Genetic and Self-Reported Information will be stripped of names, physical addresses, email addresses, and any other Personal Information that may be used to identify you as a unique individual.

4.3 We may disclose to third parties – Yfull.com. Partners or service providers (e.g. our contracted genotyping laboratory or credit card processors) use and/or store the information in order to provide you with YFull.com’s Services.

Is Screen Scraping and Data Mining Wrong?

There are two sides to this argument.

At the time of the initial discovery, a year ago, with my project participant, based on my communications with some project administrators, it was clear that at least some of the admins knew of this activity and were supportive.

Why?

Because they perceived that the data was “public domain” and the resultant semargl website and “knowledge base,” as they phrased it, justified the means. These sentiments were expressed by multiple project administrators, separately, although now I realize that at least one of these people is a project co-administrator with the semargl owner, whose identity I didn’t know at that time. Their interpretation of public domain is incorrect, because public domain refers to works “whose intellectual property rights have expired” and this is clearly not the case. What they probably meant was that since the data has been posted publicly, from their perspective, the data at that point is freely available to use.

In some circumstances, that might at least partially be true.  But since this site is in Russia, they are not bound by any laws here and they clearly did not choose to abide by any of the generally accepted netiquette standards.

Having said that, the semargl site is wonderfully done and extremely informative, which is why genetic genealogists have embraced it.  Many probably don’t realize how the data has been obtained.  Combine that with the mindset of “there’s nothing we can do about it anyway,” since they are in Russia, and many have simply resigned themselves to the fact that the situation is what it is.  Besides that, brining this topic up causes you to be extremely unpopular in some camps.

Semargl vs Family Tree DNA

This is probably a good time to define how the semargl site is different than the Family Tree DNA site.  Family Tree  DNA is focused on genealogy, which includes surnames and oldest ancestor information.  They also support and encourage testing of markers that reveal deeper ancestry, before the advent of surnames, which falls into the anthropological timeframe.  After all, that’s still the history of our ancestors, revealed in their DNA – but before surnames.  At Family Tree DNA, people join themselves to projects and they give permission when testing for comparison of their data.  If they so choose, then can remove their data from projects, make their information entirely private or remove it entirely from the data base.  In other words, they own and control their data.

The semargl site does not focus on genealogy and is generally focused on haplogroup definitions (by both SNP and STR markers) and population movement and settlement relative to haplogroup subgroups.  In that way, it’s more of a research support endeavor.  It’s not genealogy focused although it has the potential of helping genealogists understand the genesis of their ancestors before surnames.  Having said that, they do have marker matching capabilities but without surnames displayed.

Of course, we know how they obtain their data, screen scraping the Family Tree DNA and YSearch sites, and that people whose data is displayed have not given permission and may be entirely unaware their data appears on that site.

Let’s look at an example of what semargl has done with DNA information. I’ll use haplogroup Q since it is a smaller haplogroup than others and one I’m familiar with.

They have divided haplogroup Q into 30 groupings based on SNPs. Each of these branches has its own map. The Q1b-Ashkenazi map is shown below with associated kit numbers to the right under the ad.

semargl q

The map above, is by SNP, not by STR or individual match like the project and personal maps at Family Tree DNA.

This is followed by a table of STR marker haplotypes, by kit number, which is exactly like the data at Family Tree DNA.

semargl q str

STR table in color.

semargl q str color

Each haplogroup by SNP has a distribution map. This is not by subgroup, but by main haplogroup. Haplogroup Q is shown below.

semargl q pie

You can also select any SNP to view. I’ve selected L294 at random. Notice that the results are noted as from FTDNA (with kit number) or YSearch (with user ID) and those are the only sources given, so the origin of the data is very clear.

semargl snp

You can also inquire by country. Albania has primarily three haplogroups found.

semargl albania

You can query by haplogroup placing results on maps and other types of queries as well.

This owner(s) of this site has done a prodigious amount of work, and it is all very useful, and very well done. It’s actually too bad this isn’t a collaborate work, because I think it would have been very well accepted under different conditions.  Most people would have gladly given permission had they been asked.

Unfortunately, the method used to obtain the data generates a lot of unanswered and pretty ugly questions.

Begging the Questions

Some people feel that if this site were to disappear, that the genetic genealogy community as a whole would suffer. It is the only location where aggregated SNP data is processed and analyzed in this manner.

They also feel that because the individual information has been publicly posted elsewhere, in this case, in Family Tree DNA projects, that this site, and others who might be doing the same thing, have done nothing wrong, unethical or inappropriate.

Others feel that this screen scraping/data harvesting of Family Tree DNA project data is an ethics violation in the strongest terms and that if this activity had been undertaken by someone within the US or within reach of the US via copyright treaty, it would be prosecutable under copyright laws.

Originally, many felt that since these people were “just genetic genealogists” trying to understand results, focused on just a few haplogroups in which they were personally interested, and since they weren’t selling anything, that there was no conflict of interest. However, the site has clearly grown exponentially and evolved over time, robots created and utilized, donations are being solicited, and now a company is involved as well, formed in 2013.  And now we discover that the site owner is a project administrator at Family Tree DNA, giving them unprecedented access to DNA results beyond what is available publicly.  One might suggest that is a conflict of interest.  In defense of Family Tree DNA, a year ago it was almost impossible to discern the name of the person behind the semargl site and I was never able to obtain an e-mail address, even though it was clear that the intermediaries were communicating with him.  People on the internet use pseudonyms and screen names regularly, as you can note in the Wiki entry about the YFull team.

Clearly, the people responsible for the robots that were and continue to disrupt the Family Tree DNA site and taking YSearch down have to be aware of that and they didn’t and haven’t stopped their activities. Was it these robots? I don’t know for sure, but semargl has obviously been utilizing robots, screen scraping the Family Tree DNA site for more than a year based on when my participants data was harvested.  In fact, they are still utilizing robots, because my husband’s Big Y SNPs that were posted at Family Tree DNA (a subset of his total SNPs) one day this week were displayed on the semargl site the following day.  Furthermore, one of the YFull principals is credited with developing these robots and is also noted as being a project administrator.  Project administrators are supposed to be trusted stewards of the DNA of their participants.

Because the provider’s services were disrupted, one can’t really argue that no one has been damaged. Family Tree DNA has clearly been and continues to be impacted, their customers have been inconvenienced.  Family Tree DNA spends money on bandwidth and staff to deal with these issues.

Some would assert that the expectations and rights of those whose results have been pirated, harvested or stolen, depending on your perspective, have been violated because the results have been used without permission of the participant. Others would say that there has been no harm because the results are anonymized (currently) on the semargl site with the surname removed from the display and they were retrieved from a publicly available source.  However, the surname is still stored in the semargl system, because you can query by surname and all kits numbers with that surname are returned.  With some creative Googling, you can uncover the surname relatively easily given just the kit number on the semargl site, but I know of no way you could discover the actual identity of an individual unless that person was the only person in the world with that particular surname, or if they had themselves posted their name and kit number together on a public venue.

If participants refuse to join projects in the future, or withdraw from projects because they don’t want their data to be harvested by sites like this, then genetic genealogy as a whole has been damaged.  Then so have you and I as genetic genealogists.

Let me quote my husband, who never gets ruffled, this evening, when I showed him his results.  He knew nothing about any of this before I sat him down at my computer and showed him his results, first at Family Tree DNA, where he was excited to see his extended haplogroup and Big Y Novel Variants, and then on the semargl site.  I wish I had taken a picture of the shocked look on his face.  Here’s what he had to say when he saw his results on the semargl site:

“What the <bleep>?  How did they get there?”

Pause for a moment while the reality soaked in.

“Get them off there.  They have no right.”

I really can’t quote anymore of what he said and remain family friendly, but suffice it to say the word appalled was used several times, along with horrified, and when I showed him that the semargl data base owner was a co-administrator of his haplogroup project, he shifted to utterly livid and suggested that Family Tree DNA remove him and whoever added him as a co-administrator as well for complicity.  In fact, his “suggestions” went even further, to removing all of the project admins as co-conspirators, because they obviously knew what their co-admin was doing and did nothing to protect his data, as a project member.  In fact, some of them may well be involved in the exploitation of his data.

His uncomfortable questions continued, like “How can that be?” and “Does he have the rest of my data too?”  Suffice it to say my husband is utterly furious, and when I told him that I can’t have those results removed from the Russian site, and why, it got even worse.  Maybe it’s a good thing they are in Russia.

On the other hand, others argue that many benefit from the semargl site and that the people who join projects and whose results are publicly posted had no reason to expect that their results would not be harvested or utilized by someone, at some time.  Try explaining that to my husband, whose comment when he saw the ‘donate’ button right beside his results on the semargl said to me, “How is that right, they’re getting money for something they stole?  My DNA results, that I paid for.  My God, they had my results posted on their site before I even had a chance to look at them at Family Tree DNA.”

One DNA project clearly states on their main project page that once you post your information on the internet, it can never be entirely “removed.”  Of course, DNA testing for genealogy without sharing is entirely pointless.  Where is the line between sharing, when an individual intentionally joins a project, posting their own data, and theft?

The only difference between cousin Johnny discovering that you descend from the same genealogy/genetic line based on your surname project at Family Tree DNA and Russian data miners harvesting the data is the order of magnitude, intention and methodology. As someone else has pointed out, not dissimilar from the difference between consensual sex and rape.

Another perspective is that because we are here and they are in Russia, there’s nothing we can do about it, anyway, so why sweat it and just enjoy the benefits.  Right? Besides, as has been pointed out to me, we don’t want participants to become upset and withdraw from projects or not join, so we won’t discuss the elephant in the room.  What pink elephant?  I don’t see a pink elephant.  And we certainly, most certainly, do NOT want to have to answer any of those uncomfortable questions my husband asked me this evening.  After all, their DNA is already out there and there’s nothing to be done about it now, so don’t make waves.

“Doing something” now to prevent harvesting, assuming there was anything that could be done, is like closing the barn door after the cow has already left, or, in this case, the pink elephant.

This fatalism sounds a whole lot like the thought process involved in how slavery was justified along with gender and race discrimination and Hitler’s genocidal atrocities.  I’m not equating data mining to those things, but I am saying that the thought process that “we can’t do anything about it” or “everyone else is doing it,” so we accept it and even participate can be a deadly, slippery slope.  And if it’s wrong, ignoring, tolerating or accepting it certainly doesn’t make it right.

Let me share a parting thought from my husband, after he calmed down enough to speak coherently.

“I feel unclean.  I feel like I’ve been violated.  My DNA has been kidnapped and I’ve been genetically raped.  It’s wrong.  It’s just wrong, in so many ways.”

So….you tell me…

Harvested, pirated or stolen? Right or wrong? Ethical or unethical? Malicious or not? Theft? Plagiarism? Does the end justify the means? Perfectly fine?

I shared with you my husband’s reaction. He’s not involved in this field like I am.  He’s much more of the typical “end consumer.”  I’m not telling you what I think. You decide for yourself.

Note:  I thought that participants would be able to view the comments entered in the “other” field.  Since you can’t, here’s what they say:

  • Inevitable
  • Wrong, unethical, non consensual, and exploitive
  • Thank you for letting us know about this.
  • It’s criminal
  • FTDNA should learn from the semargl site, then it would be more useful and legal

57 thoughts on “Data Mining and Screen Scraping – Right or Wrong?

  1. Somehow the Russians missed my brother’s results. I was very happy (for once!) to not find my surname in a DNA database!

  2. I’m not applicable in the Y – way, since I have no paternal info on that, BUT, I’m unhappy about it because ftdna.com has such a good reputation, that it can be affected by this. What do the folks at ftdna.com have to say with THEIR confidential info being ‘used’ without any approvals???

  3. Well! I’m insulted! My brother’s isn’t on there LOL. He’s G haplo with no matches and just tested positive for CTS9737+ I had a problem with a “gentleman” a couple of years ago lifting my and mother’s names, kit numbers, and X chromosomes from gedmatch and putting it all on his website. It was a major battle to get him to remove it. One of our FTNDA matches found it and said he didn’t know we were Jewish. Told him I didn’t know that either. What a surprise! Then, he got my male line and female line mixed up, and said I had a problem with being Jewish. So, he googles my maiden Kruta, and found a couple who died in the Holocaust in Germany with that same unusual surname and continued to chew me out because my Jewish relatives were murdered, and I wouldn’t own up to being Jewish. First,, there are two unrelated Kruta lines. One is Jewish and J2 haplo, the other is my step-great grandfather Vaclav Kruta, from whom I received my last name, whose line R1a1. Gedmatch stepped in and told him to remove our names, kits and X chromosomes.

    Denise E. Kruta

    Date: Sun, 6 Apr 2014 16:30:22 +0000 To: dkruta@outlook.com

  4. Just what is their purpose for stealing all this information I wonder. I had just told my husband about it after I finished reading your article, and said to him, “to put a simple explanation forward, what are they trying to do, separate the Jews from the Gentiles.” Then I read the comments and especially the one from Denise E. Kruta. I find the whole thing rather sinister.

    • I’d have to say this is a rather paranoid viewpoint. If you swap out how your results were sorted with mine, your comment could be rewritten as “they’re separating the northern Europeans from the southern Europeans, how sinister.”

      To anyone who has used the site before, it’s quite clear that the data was used for further development and refinement of the Y-tree, with tools for producing TMRCA tree graphics and the ability to search for the closest related kits to my own who happened to be in a different public project from me.

      Beyond this, should we be angry at other third parties that utilize this public data? Should FTDNA send a cease and desist letter to the Swedish blogger who made a pie chart with my kit number and surname? Or are we only allowed to be mad at those nefarious ruskies with a “strange” interest in SNP discovery?

      • Filling out the paperwork for DNA testing with a major educational institution and they don’t consider it paranoid. They go out of their way to make sure I understand the potential consequences of testing including privacy issues, not only for me but for other family members who aren’t testing.

      • Joe D.,

        Then perhaps this is an issue with how FTDNA handles the public projects. It should be clear to anyone joining that by joining your DNA results will no longer be hidden behind a secure password protected site but can be seen by anyone, and could potentially be used either legally or illegally.

        Kyle

  5. It is my opinion that FGC, YFull and the Semargl website are filling the vacuum left by FTDNA’s inability and/refusal to satisfy it’s customers wants and needs.

    FTDNA is still using the 2008 yDNA tree for heaven’s sake.
    You know, you really are not being treated as a true participant in “citizen science” when FTDNA withholds the yDNA tree that they derived from your DNA and with the help of your finances.

    Many of the FTDNA administrators have publicly expressed frustration with all the excuses FTDNA have given for not releasing the 2014 yDNA tree. I also noticed that these administrators also post on other discussion boards perhaps to avoid the overbearing censorship on the FTDNA discussion board. What does that tell you?

    It is utter nonsense to expect your DNA data to be public only on the internet in the U.S. and only on the FTDNA website.

    The bottom line is that if you don’t want your DNA data to be public everywhere on planet Earth, you should ask for it to be withdrawn from “public” FTDNA projects.

    • Word!
      I was shocked to read this post; something so prejudiced from an otherwise respected blogger …

  6. By the time I read your post and tried to access the site, the page was empty and read “Error 500.”

    • Now it says this.
      “Sorry, but some “people” want to see my site closed.
      I can assure you that the cost of maintaining the site hundreds of times greater than the hypothetical profit. Sad. :(”

      The entire point is that there is a right way to do something and a wrong way. It’s not about wanting a site closed or not – but wanting the rights of the people whose results are being utilized to be respected. If the only way that site can exist is to operate unethically, then yes, it should be closed. I don’t think that is the only option though, although it’s the option they chose.

      • Shame many who have an issue with this article did not bother to read the whole thing. You make excellent points and the smart thing would have been for them to work with Family Tree DNA and not use robots to grab the information. Sorry to see you getting so much grief over it.

  7. When I tried to access the site to see for myself, I received 2 error message.. a 400 and a 500. The third attempt got this message in red on a black background.

    “Sorry, but some “people” want to see my site closed. I can assure you that the cost of maintaining the site hundreds of times greater than the hypothetical profit. Sad. :(

    What is “sad” is when people steal from others and make excuses to justify it.

  8. Dear Roberta,

    Success! When I went to the link you provided this is what I found, “Sorry, but some “people” want to see my site closed.”

    When clicked on the hyperlink above it displays your web blog with this same article. Who knew you had multiple personalities?!! Clearly you are the “some people” the persons is referring to. I say good job!

    Richard Hay

    • I would like to add the following concerning the paragraph, “If participants refuse to join projects in the future, or withdraw from projects because they don’t want their data to be harvested by sites like this, then genetic genealogy as a whole has been damaged. Then so have you and I as genetic genealogists.”

      I whole-heartedly believe that this was an attempt by some incredibly talented people with quite possibly decent intentions. I absolutely agree with your assessment of the potential damage to the community’s reputation at large. You have mentioned in past articles that you have family members that you have offered to help through their DNA discovery have flat-out refused to be swabbed. I think that was likely due to this very issue of uncertainty of how their genetic material will be used and protected. Your husband’s very recent response to this issue further supports this idea.

      So, those are the personal touches that help explain the issue to the masses so we may fully understand the true dilemma. It goes a long way to build that sense of understanding of why it should matter to each of us as individuals. But what about places like FTDNA, Ancestry.com, YSearch, or 23 and Me? Did they not provide the funding, technology and expertise to make this all possible? I would argue not only did they do just that but we also paid the premium to participate in their services. There was a reasonable expectation our information would be protected and I believe they have all done a reasonable job. However, we were still duped into trusting that other credentialed experts would respect those protections, as well. Thank you for bringing this issue to someone who had no prior knowledge.

      There is no excuse for these foreign nationals’ greed and ill-intention to circumvent the copyright laws or the ethics of their trade. After all, like you have pointed out, these foreign nationals were administrators of the sites with the info they presumably had to agree to. Russia, Mexico, Canada, all have ethics courses of part of their degree granting institutions. To assume these three foreign nationals didn’t have this same level of education or to assume they shouldn’t be required to use equivalent or better business/trade ethics than their own nation would be negligent of each of us. They definitely knew of the FTDNA standards when dealing with our data based in the U.S. and the perpetrators acted irresponsibly and are criminally negligent of business and trade ethics violations.

      There are domestic U.S. and international agencies that do target this type of crime. Yes, I believe intellectual property theft and then turning around that same stolen intellectual and technological property to a second group of customers is theft on multiple levels. In this case it’s on an international scale. Yes, I believe the property, in the form of the data I paid to be supplied to me, was also personally stolen. I just couldn’t access the website to view it. Yes, I believe this is also theft of protected health information under GINA. These people are criminals. They and those criminally-minded like them must be fought to protect our individual rights like you are doing now.

      Then the question comes to you, Roberta, and your fellow FTDNA administrators with this first hand information… Will you, the truly ethical genetic genealogists, please turn over your first hand information supporting this theft and breach of trust to the FBI so that we can get some resolution? I would love to see these people’s credentials and ability to operate again revoked completely.

      I will support you in any way I can.

      Richard Hay

      • Isn’t if funny that FTDNA proudly advertises that they test worldwide?
        So do they still want the business of “foreign nationals?”
        It doesn’t sound like it.
        I am sure FGC, YFull and YSeq will be happy to offer an alternative.

      • @Joe on April 6, 2014 at 10:03 pm said:

        Isn’t if funny that FTDNA proudly advertises that they test worldwide?
        So do they still want the business of “foreign nationals?”
        It doesn’t sound like it.
        I am sure FGC, YFull and YSeq will be happy to offer an alternative.

        No, not at all. Our DNA comes from the various parts of the world from which it originated. Why wouldn’t FTDNA continue to test both within the USA and worldwide as it currently offers? If there was somehow a misunderstanding of my terminology of foreign nationals, I sincerely apologize for the misunderstanding. It is not my intention to vilify any non-US citizen of any nation.

        “Foreign nationals” in my prior response was used to distinguish those named individuals that are cyber-mining our DNA results and circumventing US domestic law from those foreign nationals who did not. I am accusing these specific foreign nationals of purposefully setting about the illegal capture and use of our DNA data for their own personal/professional gain, and equally for illegally funding their research by using this same stolen data without consent from the data owner nor the individual who tested their DNA to begin with.

        These named individuals used their foreign citizenships and physical presences in a foreign nation to do the said circumventing. This is as opposed to any other individual, US national or other, who had not attempted to do this. It is an important point to be made that none of the named individuals were foreign or dual citizens that also reside in the US. If there are foreign national genetic genealogists working in the US then they presumably continue to operate ethically like their US national colleagues. So far they have successfully worked together as colleagues in support of continued genetic genealogy efforts. I see no reason to suspect they would not continue to do so. I also suspect that with a significant quantity of foreign talent assisting in some capacity at FTDNA that since they have not been revealed to be stealing our data that they, in fact, continue to be operating ethically. Therefore, if we are all on the same side of this ethical debacle why would FTDNA want to change course based on something so very arbitrary? Have they indicated a desire to do so?

        I have no idea what you are alluding to when you state that other sites may offer some other alternative? What alternative are you suggesting they might offer? Improved services that FTDNA hasn’t yet made available? If so, have you attempted to contact FTDNA with your suggestion(s) of improved services? I would be interested to hear your interactions with FTDNA. It’s always good to learn more and know more. I enjoy being well-informed.

  9. Semargl clearly have a desire to take on the Hackers, this is the kind of thing they warn us about, but usually expect it to be done by Government Agencies, they will get under someones skin for certain and can quite possibly expect a sustained and vicious attack, which i for one will look forward too.
    Personally if they had asked for my information for their research i would have been happy to provide it, but i cannot except it being taken by stealth, i can only wish the 7 plagues upon the company and all associated with it.
    what a low life pack of scum.

  10. Joe: ………overbearing censorship on the FTDNA discussion board….
    Me doth think you are exaggerating.

    Using my real name to test at all 3 companies, was a really big mistake. What was I thinking??.

  11. @caith: You wrote: “Me doth think you are exaggerating.”

    Nope it’s true.

    The first weekend of the BigY release resulted in the FTDNA discussion board put on “full moderation” meaning all posts had to be pre-screened before being posted.

    Many BigY customers were justifiably upset that their results were not ready as promised on Feb.28 and some of them were threatened via private message by the moderator.

    This week I posted on the FTDNA board about a possible DNA Day sale this April 25 which is traditionally held each year. *Poof* Vanished.

    It feels like to me that a “Putin-like” character is running things at the FTDNA board.

    • If FTDNA had any sense, they would create a site like Semargl, or enter into an agreement with them (this can simply be done by telling those who join public projects that their information can be shared; if somebody doesn’t wish this – don’t join), because without this further analysis many people’s y-DNA results are almost useless.

  12. Great.
    All data in the cited site is grabbed from OPEN genealogical DNA projects.
    O-P-E-N!
    I’ve tested more than 30 persons by my own money. And all these guys and me we are waiting for more exhaustive results and interpretations.
    If FTDNA can’t do this job, we say our thanks for amateur enthusiasts who can and who do it for free.
    If you and your husband are afraid to divulge your markers’ values – just change your privacy level. That’s all.
    Shame on you!

  13. The privacy level is user-set and changeable. Copying of data (like temperatures from a weather site, or player stats from popular sports, is not the same as copying creative works and your political and nationalistic snarks are particularly disappointing from someone whose blogI have enjoyed so long. You have attacked a person who has out of their own time and treasure created a uniquely valuable source of analysis of R1a data in particular, and now the site has been taken down over a porrly reasoned histrionic conspiracy theory, rife with personal and political undertones. I am very disappointed.

  14. Getting the Semargl site closed is robbing the community of one of our most valuable tools for YDNA-research! In our work as project admins we make use of the Semargl site almost every day to help project members get more out of their YDNA results. Semargl is giving us far, far more useful YDNA tools than FTDNA can provide. And when it comes to YFULL, which is also criticized in the blog, their work is crucial to the community’s ability to thoroughly analyze all the BigY results and identify new SNPs. There’s no way we could do that on our own using Excel…

    • I hope Semargl re-opens his website. I would gladly give him written consent to use “my data.” Also, I would not be surprised if some of the current FTDNA administrators decide “they ain’t gonna work on Maggie’s farm no more.”

  15. Why are projects “open” anyway? Shouldn’t they just be for people who join and administer them? Is there a way to make them secure? Seems like they could make them accessible only to FTDNA customers who have logged in AND are members of the group. That doesn’t mean that a member won’t share others’ information, but maybe would cut most of this scraping out.

    • People are clever. If there is a way to circumvent a security protocol it’s only a matter of time before it gets discovered and then the next generation of cyber-mining bot is upgraded to perform the more advanced protocol. And so it continues.

      But yes, there are steps that can be taken by consulting a cyber security technology firm. That quite plainly equals $$$$.

      How much more are you willing to pay for the more secure service? And even if you can does that mean that many customers wouldn’t jump ship to a more cost competitive alternative? I know I can’t afford the $695.00 BigY offer. If they added the necessary additional cost for the added security services I would be pushed out of the opportunity for quite a while longer until I could raise the funds needed.

      • Using YDNA in your genealogical research requires that you compare your results with as many people as possible. To have “secret” projects would make YDNA research useless, and we don’t want that do we?

  16. I don’t think it’s so much the what as the how it was done. Without the knowledge or consent of the individuals whose data was being used or the company who has invested so much into making it available for use. Does the end justify the means? Other 3rd party developers rely on participants to opt in, which is a slower method, but doesn’t leave people feeling violated and surprised/shocked to find their data on a site they were unaware of.

  17. I’ve just asked my wife if she consider her pregnancy test as her intellectual property.
    She was confused a bit.
    Then I asked her why don’t she post her pregnancy test on Facebook?
    And she was confused again.
    Honey, but if you publish, would you mind if someone will include it into the worldwide pregnancy tests data base?
    - That’s why I’m not posting my test on Facebook, – was her answer.

    If someone spent his money for DNA-test, I guess he want to find answers to his questions.
    Then he joined a project, made his results public, why? I guess he want to know more.
    Why than participation in the other project and getting even more knowledge hurt his filling? It’s just another step in the same direction. And it’s not Russia->Putin->KGB route I mean.

    FTDNA is just a for-profit company, taking our money for services provided. It doesn’t meet deadlines, doesn’t provide satisfactory level of interpretation of results, uses outdated Y-tree, outdated SNP-database, selling me 205 “novel variants” 150 of which already have names and included in ISOGG data base.

    Semargl, YFull, Gedmatch are giving us a chance to have a “second opinion” for free. To sweeten the bitter aftertaste of FTDNA experience, if you wish.
    And killing this “competitors” is a bad idea. Hardly it will increase FTDNA revenue, but obviously leave company face to face with unsatisfied customers.

  18. People are worried about the security of their data on DNA sites. If this is happening now imagine what will be happening in 10-20 years plus. What could the data be used for then if in the public domain?

    Also many I speak to are worried about uploading to FTDNA in particular which doesn’t help me or others. There are no cases of DNA data being lifted from 23&Me & Ancestry keeps it ‘hidden’ – perhaps this is wise given it can be used elsewhere ?

    • This is the Y chromosome we’re discussing, not autosomal or the coding region of mitochondrial. I’m not so much worried about what can be “done” with this. I’m concerned about widescale data scraping of our data without any kind of permission of the owner. The Y STR results are visible for Ancestry as well. It’s just that they don’t offer SNP testing which is the majority of the data being harvested at Family Tree DNA.

      • No, it’s your data. I said in the article, I WISH he had done that. The site is a great tool – the issue I have is the lack of consent and the wide scale taking of the data without consent. With consent, I have no issue at all with the semargl site.

  19. May I post this to a genealogy facebook page? the info is excellent.

    Sent from my iPad

    >

  20. Roberta, what are your thoughts on Chris Morley’s Y-Tree work? I certainly gave no consent to Chris allowing him to use my Geno 2.0 results, but it’s there, kit number and all. In a previous post you endorse the Big-Y add on that submits your data to his site for comparison of discovered SNPs with this aggregate data. The disclaimer on the Y-Tree site tells third parties interested in for-profit use to contact Chris to discuss LICENSING. Does this trouble you as well?

    Thanks,
    Kyle

    • Hi Kyle,

      Maybe I’m looking in the wrong place. When I go to Chris Morley’s site, I see this link and nothing more that shows individual results. http://ytree.morleydna.com/

      I don’t use his site and I have never received complaints from anyone about their date appearing there unauthorized. Could you provide a link to what you are referencing?

      Thanks,

      • Roberta,

        Thanks for the reply, here’s the content I’m referring to:

        http://ytree.morleydna.com/ExperimentalGenoPhylogeny20140207.pdf

        Don’t take this as a complaint from me with regards to the work by Chris, I support it and find it helpful. However, I did not provide express written consent for my data to be used, and the PDF itself states that his experimental phylogeny was based on data gleaned from public FTDNA projects:

        “The Geno 2.0 data used to create this report comes from public FTDNA project Y-SNP reports, not the Geno 2.0 raw data files. Each Geno 2.0 kit
        featured in this report has been transferred to FTDNA and then added to a public FTDNA project.”

        So he either used his own script or used the data collected by semargl to generate an updated Y-tree superior to FTDNA’s outdated listings.

        Regards,
        Kyle

      • I don’t see kit numbers displayed or people’s data revealed. I think it’s one thing to do analysis and another to screen scrape and reproduce and display someone’s data. Now as far as how the data was gathered – if someone is running scripts against the vendor’s data base and it interferes with the performance of the site for paying customers, then that would be a problem. My issue with the semargl site was that individual people’s data was taken and reproduced on that site without their permission. I think we as a community do a lot of analysis – which is different than reproduction of data. There is probably a fine line someplace between the two, and I’m not sure where that line is or should be. But I clearly think the semargl site is over that line.

      • Joe – These things need to be discussed. Discussion does not mean under fire. Perhaps you gave consent to have your data on the semargl site. My husband didn’t, my project member didn’t and the data was not removed when asked. Nor were either the participant nor I replied to when we asked how the data got to that site. Vastly different.

      • Roberta,

        The kit numbers are embedded after each terminal SNP, with a direct link to the public FTDNA project page it was pulled from. The document contains SNPs that I tested positive for including my terminal SNP, which is inherently my data. Now as you indicated in your post, semargl does analysis, and in your comments you say you feel he crossed the fine line by duplicating raw results. In your opinion, does that line get uncrossed if he obfuscates the data he possesses on individuals while only presenting derived results?

        Kyle

  21. Roberta, Is anyone violating FTDNA’s “intellectual property” rights if they voluntarily consent to having their own raw data displayed on Semargl’s site? I paid FTDNA for the testing and I can do whatever I wish with my results. Am I wrong? Does FTDNA claim any right to prohibit me from doing so?

  22. FTDNA vs Another Free Site (sorry to hide the name from obsessed by the conspiracy theory vandals)

    Known SNPs 36,551: (FTDNA) – 52,112 (Another Free Site). I.e. about 43% more.

    Novel Variants: 318 (FTDNA) – 80 (Another Free Site). I.e. about 4 times more precisely.

    Closed out of this blog, Semargl site allowed to receive haplotypes with any reasonable genetic distances. You really need this for example if you decide to build even the most primitive phylogenetic tree (if you know what I mean).

    One neophyte here is so excited: “Somehow the Russians missed my brother’s results. I was very happy (for once!) to not find my surname in a DNA database!”
    Sure, it’s just your brother privacy level choice. Nothing else. Not necessary to destroy something. Just try to understand.

    • Should we expect similar warnings about FGC soon? After all, their main tester is BGI… formerly known as the Beijing Genomics Institute and at least partially funded by the Chinese government, and if Wikipedia is to be believed, a “recognized state agency.” The previous post by Roberta on YFull appeared to strongly recommend against sending your BAM file to this site, with plenty of warnings about Russia. What about China, which is perhaps the worlds largest violator of intellectual property? While YFull would get your BAM file, BGI would have your BAM and have your DNA sample. The January 2014 New Yorker article also seemed to hint at work by BGI in embryo “selection,” which some may consider unethical.

      • Where you choose to test is your business and your personal decision. To the best of my knowledge there are no Chinese screen scraping project data from FTDNA. And if they are, they aren’t reproducing it on a website someplace. As long as I’m a project administrator, and they are screen scraping my projects and my participants DNA, including my family, then that’s my business. The reason I don’t recommend YFull is because it’s at least some of the same people engaged in the screen scraping activity AND they are in Russia, beyond the reach of any legal remedies.

  23. Thanks Roberta. As a writer, I have pretty strong opinions about intellectual property and plagiarism. This is not “public” data people. It can only be accessed once someone has agreed to certain terms and conditions, which are quite likely being violated.

    I hope FTDNA won’t be forced to hide their data like 23andMe does. This sort of theft only gives ammunition to Ancestry when they claim privacy issues surrounding chromosome matching tools.

    And Joe, did you give permission to this Semargl person. I certainly did not!

    • Marci, This “Semargl person” of all the people in the world just happens to be one of my closest yDNA matches.

      We have a shared interest in the development of yDNA tree in the interest of science. I view my consent to make my data available for research as a charitable donation for the benefit of all. If you don’t want to give your consent, that is your perogative.

      Good luck finding the answers you seek within a genealogical time frame using FTDNA alone.

  24. Data Mining and Screen Scraping – Right! This is OPEN. This is liberty.
    I went to Google and typed ftdna Vorontsov (my family) – 1000 responses with my Y-DNA. Google is stealing your intellectual property?

  25. In this discussion, people imply that Semargl is driven by ”the sake of creating his own database”, by the urge to disclose people’s data or some other evil plan. People who say that have clearly never used the tools at the Semargl site and do not have the knowledge about what in-depth genealogical YDNA-research is and what it requires.

    And you clearly don’t know Vladimir (Semargl) either. He and his fellow YDNA-specialists does this on their free time out of love for YDNA and genealogy. They know that the FTDNA match lists and projects aren’t the most effective tools to make the most out of YDNA for genealogy, so they use their skills to give the community the tools we need. They are worth all the respect in the world!

    If the Semargl site is down forever, heaven forbid, the genealogical YDNA-research will be thrown years back in time and progress will slow to a trickle. That is NOT in the benefit of anyone using YDNA in their genealogical research and I hope that everyone can look beyond data-sharing-scare and nationality issues and come together for the sake of YDNA-genealogy.

  26. I guess it’s OK that my money for my Y-result could help me and other researchers/payers to know possibly more details about all things concerning our genealogy. So, It’s fine, no problem. I hope the problem will be resolved as soon as possible.

  27. I have found semargl’s site to be a godsend and am truly saddened to see it closed. It has provided me with crucial information that the public projects on FTDNA has or will not, and has been bookmarked on my computer and accessed many more times than FTDNA. If FTDNA has a problem with the service provided by semargl they obviously don’t understand how semargl compliments their services. As for YFull, it is a wonderful service. By the way, I have been communicating with semargl for several years now. He is most professional, having gone as far as to place my personal results in a tree format, so that I could see who are my nearest matches, simply because i requested it. This is the type of personal (and free) service that one would never receive from a company such as FTDNA. After all, what good is your y-dna data if it is in isolation?

  28. Well, let’s think a bit.
    My kit # is 311318. According to FTDNA there are no matches with me at the levels of above 12 STRs. According to “that Russian site” THERE IS. We are at 8 steps on 67 markers. The point is that FTDNA indicates only 7 steps on 67 markers. And if we don’t know that we are on 8 steps on 67 markers, we don’t by upgrades to 111. Thus FTDNA have to thank that guy instead of banning.

  29. Semargl site should be open again. It has been very usefull and I am so sorry for them, because this fuzz. Thy do exellent work!

Comments are closed.