Hackers and Your Genetic Secrets

Did that title get your attention?  Well, it was meant to, just like it was meant to in this NBC article titled “Scientists Demonstrate How Hackers Could Unlock Your Genetic Secrets.”  Or how about this one in the New York Times, “Web Hunt for DNA Sequences Leaves Privacy Compromised?”  Sensationalism sells….and so does fear.  Don’t panic, the sky is not falling.

I’ve had several people forward me a variety of links to several articles about this expressing concern.  Most people didn’t really understand what was going on…and since “family tree databases” were mentioned in the first paragraph, it frightened them.

This article says that the “security cracking trick relies on the availability of genetic information linked to surnames in a variety of public family-tree databases.”  Well, that’s sort of true, but not exactly true.  The issue is not the family tree databases, it’s the fact that the researchers in The Thousand Genomes Project, while keeping the names of those 1000 people “anonymous,” provided enough information that these scientific researchers, not hackers, were able to data mine the 1000 Genomes participants information to determine their Y-DNA marker values, then compared those haplotypes (marker values) just like we do in databases such as Ysearch and Sorenson.  And yes, they likely had matches to several surnames, like most of us do.

Individuals in the 1000 Genomes Project signed a release indicating that they knew that their data was to be used publicly, although their identity would not be revealed but that researchers could not guarantee their privacy.  The 1000 Genomes Project, unfortunately, posted the ages of the participants, which at the time seemed innocuous enough, and it was common knowledge within the scientific community that they all lived in Utah.  With these three pieces of information, their age, their location, and from the scientists data mining, a possible surname, the scientists were then able, if the surname wasn’t something like Smith or Jones, to use publicly available Google and “white pages” types of searches to find people in that state, of that age, by that surname, and then using obituaries and such, connect them through online family trees to their more distant families.  They did this with Craig Venter, for example.

This technique is nothing new to genealogists, as we’ve been finding cousins that way for years – the difference being of course that we didn’t data mine, otherwise in this case more aptly referred to as “scientific hacking,” the 1000 Genomes Project in order to find their Y-line DNA markers to determine a possible surname for them.  That is the issue and the point of this article and ironically, it’s scientists who did it, then published the “how-to” manual.

Any genetic genealogist knows, especially anyone dealing with adoptees, that you can only reveal a biological surname about 30% of the time.  In fact the scientists success rate was lower, 12%.  But that’s actually irrelevant in the bigger context of the article.  Their point was that they succeeded at all.

This is sort of like putting personal information on the internet, except your name, and then being surprised that someone could connect the dots and put the pieces together.  No one would be surprised today if that were to happen.  In fact, I’m sure we all have received cautions and warnings about putting too much info on Facebook because burglars were robbing homes when people were vacationing.  Many people have their hometown, their high school and their birthday and year publicly available on Facebook.  Now how many “security questions” does that answer right there?  Combine that with your dog’s name and your mother’s maiden name and you’ve got almost all of the common ones.

Aside from the fear-mongering, I have three issues with these reports as a whole.

1.  Statements like “they traced those three family tree pedigrees to find other connections between relatives and sensitive genetic data.”  Whoa, stop right there.  Just because you share a surname or even if you are a direct and immediate relative, that says nothing, absolutely nothing, about whether or not you inherited some genetically disposed health issue.  Remember, children inherit half of their DNA from each parent.  So unless they are finding identical twins or parents, one cannot infer that an entire family tree of people share frightening health traits.  It’s irresponsible to suggest otherwise.

2.  “For years, experts have worried that sensitive genetic data could be used to discriminate against patients, potential employees or would-be insurance customers.  Such discrimination is illegal when it comes to employment or health insurance, but the law doesn’t’ cover life insurance, disability insurance or long-term care insurance.  Theoretically an insurer could search through genetic records and turn you down because you have a genetic predisposition to, say, Alzheimer’s disease.”

Discrimination is an issue, and laws have been put in place to prohibit discrimination in the workplace.  But insurers aren’t going to sift through genetic data like a private investigator.  Suggesting this is unnecessary fear-mongering.  Insurers don’t do that, they simply tell you that a blood test is a pre-requisite of obtaining insurance.  I know, I bought life insurance and they sent a nurse to my house to verify my identity and take a blood sample.  At that time, they were looking for diabetes, AIDs and probably a whole lot more.  Today, they might be looking for genetic pre-dispositions.  I don’t know, but I do know they have a direct method of obtaining that information and it’s not spending untold hours sifting through someone else’s data that likely isn’t relevant to you anyway.

3.  This “research” project was inspired at Whitehead Institute, an affiliate of MIT, a publicly funded institution.  When Yaniv Erlich dreamed up this new hacking technique, he said he couldn’t resist trying it, so instead of simply discovering a potential issue and privately and quietly working with the proper people to resolve the issue, he decided to exploit it publicly, obtaining, I suppose, his 15 minutes of fame.  So yes, your tax dollars did indeed likely pay for some or all of this “research.”

In one of the articles,  Dr. Jeffrey R. Botkin, associate vice president for research integrity at the University of Utah, which collected the genetic information of some research participants whose identities were breached, cautioned about overreacting. “Genetic data from hundreds of thousands of people have been freely available online,” he said, “yet there has not been a single report of someone being illicitly identified.”  He added that “it is hard to imagine what would motivate anyone to undertake this sort of privacy attack in the real world.” But he said he had serious concerns about publishing a formula to breach subjects’ privacy. By publishing, he said, the investigators “exacerbate the very risks they are concerned about.”

Well, it’s obvious that these folks at Whitehead institute don’t live in the real world and clearly don’t have enough real scientific research to do.

So, what is the take home of all of this?

  • You are not at risk of having anything exposed in this incident unless you are one of the 1000 people in the 1000 Genomes Project.  If you are part of the 1000 Genomes Project, and male, there is a 12% risk that they figured out your last name and using other tools, possibly who you are, along with your family.  If you are related to someone in the 1000 Genomes Project, the researchers might have figured out that you are related to them.  So now the risk is that they’ll do what with that information???  Guaranteed, someone will figure out the same information and much more quickly, without your DNA and without government funding if you simply stop paying your bills.
  • If you participate in a research project, such as the 1000 Genomes Project, where your full results are made publicly available, you sign a release, and that release indicates that your privacy may not be able to be protected.  You are aware of the risks before you begin.
  • We, as a community, have been warned for years not to put information that might be medically informative on the internet, such as full sequence mitochondrial DNA information.  Anyone who does so, does it at their own risk.  The people in the 1000 Genomes Project knowingly took that risk.
  • If you stay within the confines of the genealogy and DTC mainstream testing companies, you are fairly well protected.  Having said that, reading the consent forms of any of the companies makes it clear that your identity is never entirely protected.  We’re genealogists after all.  What good is genealogical testing if you can’t contact people you match?
  • Inferred health risks are not the issue they are being portrayed to be in these articles.  Your cousins health risks are not necessarily yours.  Genetic inheritance is a complex and individual event.  If you want proof of that, test your family at www.23andMe.com and look at the differences in health risks for various diseases.
  • Insurers who can use health information to restrict or deny insurance are simply going to request a blood sample.  They are not going to act like a blood hound on the scent of a rabbit and sort through tons of information for inferences.  Why would they when they can obtain the information they seek, directly and much less expensively?
  • For those researchers involved with information made publicly available, such at the 1000 Genomes Project, this is a wake-up call that perhaps less information available publicly is better.  Some information, such as ages and location should perhaps be available only to legitimate researchers, which would still have included the Whitehead Institute people, but would have taken away much of their thunder.  I understand this change has already been implemented, but that doesn’t entirely mitigate the issue of genetic data mining publicly available full genomic sequence information for identity, only makes it a little more difficult and less likely to succeed.
  • I clearly understand why hackers want my bank account information, and why identity thieves want my personal information, but why, in the real world, not at Whitehead institute, would anyone ever spend the time and effort to do this?  The motivation for these researchers was clearly to publish, but I can think of no reason other than that or simply “because they could” to spend the time doing something like this.  Who would want to and for what purpose?
  • The sky is not falling

It’s behind a paywall, but you can access the scientific article here that started all of this hubbub.