Hackers and Your Genetic Secrets

Did that title get your attention?  Well, it was meant to, just like it was meant to in this NBC article titled “Scientists Demonstrate How Hackers Could Unlock Your Genetic Secrets.”  Or how about this one in the New York Times, “Web Hunt for DNA Sequences Leaves Privacy Compromised?”  Sensationalism sells….and so does fear.  Don’t panic, the sky is not falling.

I’ve had several people forward me a variety of links to several articles about this expressing concern.  Most people didn’t really understand what was going on…and since “family tree databases” were mentioned in the first paragraph, it frightened them.

This article says that the “security cracking trick relies on the availability of genetic information linked to surnames in a variety of public family-tree databases.”  Well, that’s sort of true, but not exactly true.  The issue is not the family tree databases, it’s the fact that the researchers in The Thousand Genomes Project, while keeping the names of those 1000 people “anonymous,” provided enough information that these scientific researchers, not hackers, were able to data mine the 1000 Genomes participants information to determine their Y-DNA marker values, then compared those haplotypes (marker values) just like we do in databases such as Ysearch and Sorenson.  And yes, they likely had matches to several surnames, like most of us do.

Individuals in the 1000 Genomes Project signed a release indicating that they knew that their data was to be used publicly, although their identity would not be revealed but that researchers could not guarantee their privacy.  The 1000 Genomes Project, unfortunately, posted the ages of the participants, which at the time seemed innocuous enough, and it was common knowledge within the scientific community that they all lived in Utah.  With these three pieces of information, their age, their location, and from the scientists data mining, a possible surname, the scientists were then able, if the surname wasn’t something like Smith or Jones, to use publicly available Google and “white pages” types of searches to find people in that state, of that age, by that surname, and then using obituaries and such, connect them through online family trees to their more distant families.  They did this with Craig Venter, for example.

This technique is nothing new to genealogists, as we’ve been finding cousins that way for years – the difference being of course that we didn’t data mine, otherwise in this case more aptly referred to as “scientific hacking,” the 1000 Genomes Project in order to find their Y-line DNA markers to determine a possible surname for them.  That is the issue and the point of this article and ironically, it’s scientists who did it, then published the “how-to” manual.

Any genetic genealogist knows, especially anyone dealing with adoptees, that you can only reveal a biological surname about 30% of the time.  In fact the scientists success rate was lower, 12%.  But that’s actually irrelevant in the bigger context of the article.  Their point was that they succeeded at all.

This is sort of like putting personal information on the internet, except your name, and then being surprised that someone could connect the dots and put the pieces together.  No one would be surprised today if that were to happen.  In fact, I’m sure we all have received cautions and warnings about putting too much info on Facebook because burglars were robbing homes when people were vacationing.  Many people have their hometown, their high school and their birthday and year publicly available on Facebook.  Now how many “security questions” does that answer right there?  Combine that with your dog’s name and your mother’s maiden name and you’ve got almost all of the common ones.

Aside from the fear-mongering, I have three issues with these reports as a whole.

1.  Statements like “they traced those three family tree pedigrees to find other connections between relatives and sensitive genetic data.”  Whoa, stop right there.  Just because you share a surname or even if you are a direct and immediate relative, that says nothing, absolutely nothing, about whether or not you inherited some genetically disposed health issue.  Remember, children inherit half of their DNA from each parent.  So unless they are finding identical twins or parents, one cannot infer that an entire family tree of people share frightening health traits.  It’s irresponsible to suggest otherwise.

2.  “For years, experts have worried that sensitive genetic data could be used to discriminate against patients, potential employees or would-be insurance customers.  Such discrimination is illegal when it comes to employment or health insurance, but the law doesn’t’ cover life insurance, disability insurance or long-term care insurance.  Theoretically an insurer could search through genetic records and turn you down because you have a genetic predisposition to, say, Alzheimer’s disease.”

Discrimination is an issue, and laws have been put in place to prohibit discrimination in the workplace.  But insurers aren’t going to sift through genetic data like a private investigator.  Suggesting this is unnecessary fear-mongering.  Insurers don’t do that, they simply tell you that a blood test is a pre-requisite of obtaining insurance.  I know, I bought life insurance and they sent a nurse to my house to verify my identity and take a blood sample.  At that time, they were looking for diabetes, AIDs and probably a whole lot more.  Today, they might be looking for genetic pre-dispositions.  I don’t know, but I do know they have a direct method of obtaining that information and it’s not spending untold hours sifting through someone else’s data that likely isn’t relevant to you anyway.

3.  This “research” project was inspired at Whitehead Institute, an affiliate of MIT, a publicly funded institution.  When Yaniv Erlich dreamed up this new hacking technique, he said he couldn’t resist trying it, so instead of simply discovering a potential issue and privately and quietly working with the proper people to resolve the issue, he decided to exploit it publicly, obtaining, I suppose, his 15 minutes of fame.  So yes, your tax dollars did indeed likely pay for some or all of this “research.”

In one of the articles,  Dr. Jeffrey R. Botkin, associate vice president for research integrity at the University of Utah, which collected the genetic information of some research participants whose identities were breached, cautioned about overreacting. “Genetic data from hundreds of thousands of people have been freely available online,” he said, “yet there has not been a single report of someone being illicitly identified.”  He added that “it is hard to imagine what would motivate anyone to undertake this sort of privacy attack in the real world.” But he said he had serious concerns about publishing a formula to breach subjects’ privacy. By publishing, he said, the investigators “exacerbate the very risks they are concerned about.”

Well, it’s obvious that these folks at Whitehead institute don’t live in the real world and clearly don’t have enough real scientific research to do.

So, what is the take home of all of this?

  • You are not at risk of having anything exposed in this incident unless you are one of the 1000 people in the 1000 Genomes Project.  If you are part of the 1000 Genomes Project, and male, there is a 12% risk that they figured out your last name and using other tools, possibly who you are, along with your family.  If you are related to someone in the 1000 Genomes Project, the researchers might have figured out that you are related to them.  So now the risk is that they’ll do what with that information???  Guaranteed, someone will figure out the same information and much more quickly, without your DNA and without government funding if you simply stop paying your bills.
  • If you participate in a research project, such as the 1000 Genomes Project, where your full results are made publicly available, you sign a release, and that release indicates that your privacy may not be able to be protected.  You are aware of the risks before you begin.
  • We, as a community, have been warned for years not to put information that might be medically informative on the internet, such as full sequence mitochondrial DNA information.  Anyone who does so, does it at their own risk.  The people in the 1000 Genomes Project knowingly took that risk.
  • If you stay within the confines of the genealogy and DTC mainstream testing companies, you are fairly well protected.  Having said that, reading the consent forms of any of the companies makes it clear that your identity is never entirely protected.  We’re genealogists after all.  What good is genealogical testing if you can’t contact people you match?
  • Inferred health risks are not the issue they are being portrayed to be in these articles.  Your cousins health risks are not necessarily yours.  Genetic inheritance is a complex and individual event.  If you want proof of that, test your family at www.23andMe.com and look at the differences in health risks for various diseases.
  • Insurers who can use health information to restrict or deny insurance are simply going to request a blood sample.  They are not going to act like a blood hound on the scent of a rabbit and sort through tons of information for inferences.  Why would they when they can obtain the information they seek, directly and much less expensively?
  • For those researchers involved with information made publicly available, such at the 1000 Genomes Project, this is a wake-up call that perhaps less information available publicly is better.  Some information, such as ages and location should perhaps be available only to legitimate researchers, which would still have included the Whitehead Institute people, but would have taken away much of their thunder.  I understand this change has already been implemented, but that doesn’t entirely mitigate the issue of genetic data mining publicly available full genomic sequence information for identity, only makes it a little more difficult and less likely to succeed.
  • I clearly understand why hackers want my bank account information, and why identity thieves want my personal information, but why, in the real world, not at Whitehead institute, would anyone ever spend the time and effort to do this?  The motivation for these researchers was clearly to publish, but I can think of no reason other than that or simply “because they could” to spend the time doing something like this.  Who would want to and for what purpose?
  • The sky is not falling

It’s behind a paywall, but you can access the scientific article here that started all of this hubbub.

The Future of Genetic Genealogy – Dream Big

I spent many years working with clients in the technology space and when I did needs assessments for them, I used to tell them, “Dream Big, the sky is the limit.  Do not edit yourself by using the word “but.”  Let me do the editing.”  That freed them of all the reasons why they couldn’t and allowed them to look at everything as potentially possible.

One of our blog followers asked me what I saw as the future of genetic genealogy and what my wish list would be.  That was a few weeks ago.  I’ve been thinking.  And dreaming big.

As many of you know, I have been on a many-years (OK, multiple decades) quest to prove or disprove my Native American heritage based on tidbits and whispered secrets.  Ironically, the line where it was supposed to have existed came up quite barren, although there are still some females without surnames.  However, other lines have shown both Native and African ancestors.  So I have been duly rewarded for my years of persistence, some would say obsessiveness.

Many years ago, back in the genetic genealogy dark ages, in 2003, a company that no longer exists introduced a test that provided customers with percentages of ethnicity based on about 150 autosomal markers.  My test results were returned as 10% Native American and 15% East Asian, which was interpreted to be another flavor of Native American, for a total of 25%.

You can read about this test and others to detect minority admixture, meaning minority in the sense of not your primary ethnicity, in the paper titled Revealing American Indian and Minority Heritage Using Y-line, Mitochondrial, Autosomal and X Chromosomal Data Combined with Pedigree Analysis.  This paper was published in the Journal of Genetic Genealogy, Vol. 6 #1 in 2010.

As excited as I was about these 2003 results, I knew the percentages had to be wrong, because I had done enough genealogy that I knew that 25% equaled one grandparent, and I didn’t have that much Native ancestry.  However, it did confirm that I was not hunting for a needle in the proverbial haystack that did not exist.  And yes, I eventually found more than one needle along with a few slivers along the way.

However, obtaining that confirmation that I had Native ancestry did not satisfy me.  That would be like saying that finding a new ancestor satisfies the genealogist, and we ALL KNOW that finding a new ancestor simply whets your appetite and stokes the fires for more.  That’s why genealogy is never done.  Each discovery, each question answered, leads to at least two more.

So I began to mercilessly hound those whom I could corner and asked about using autosomal DNA for ancestor identification. I asked Bennett Greenspan about this, several times, in several different ways.  I remember him groaning and simply saying it wasn’t going to happen.  He had a million reasons why.  I didn’t care.  I knew that those were only temporary constraints.  I asked Michael Hammer, Max Blankfeld, Matt Kaplan, Bruce Walsh and I think I even asked Spencer Wells.  All of them said no, in a number of different and very innovative ways.  Well, I’m a mother, and I can say no with the best of them, and no matter how nicely or covered in techno-speak it is, no is still no.

They told me it would be too expensive, there were not enough reference models, it had never been done before, and the technology wasn’t there.  I knew they were right at that time, but logically, I knew it could be done and I hoped, would be someday. I think it was Bruce that said “never” when I pushed him a little.  He was very gracious about eating those words a few years later and kind of chuckled, shrugged his shoulders, smiled and said, “Science is science.”  It’s so true, what couldn’t be done yesterday and was barely imaginable is now routine.  Bennett’s infamous story of how Michael Hammer finally agreed to test his Y chromosome back in 2000 (if Bennett would just go away and stop hounding him) is living proof of that.  So is Michael’s “throw away line” of “You know, someone should start a business doing that.”  Never says that to an entrepreneur.  Of course, the result is Family Tree DNA.  I love living in an age of innovation and being able to work with wonderful and innovative scientists and businessmen.

My autosomal questions that met with repeated rejection were in 2003-2004 timeframe.  In 2007, just a mere 3 or 4 years later, 23andMe introduced their wide spectrum testing product.  This product tested hundreds of thousands of locations, not a few, and was really focused towards health.  However, they offered “cousin matching” and percentages of ethnicity. So, now we know how long “never” is in this industry – between 3 and 4 years.

Bennett groaned the next time I talked to him.  I’m amazed that the man still speaks to me at all.  Yes, we hounded Bennett and Max relentlessly, but being the savvy businessmen that they are, they realized that the future of genetics and therefore genetic genealogy was founded in more information, more data, and he (or she) who would be king of that mountain would not only offer the testing, but user friendly tools to use the data and results effectively and integrate them into a larger whole.

So here we are today, with the Geno 2.0 product having been just released – sporting new autosomal SNPs and thousands of Yline SNPS, more than 10,000 of them – all chip based of course using newly written coding techniques to achieve accuracy never before available.  These are all innovations that we could have only dreamed about 5 years ago, before the current technology was available, or maybe we couldn’t even dream that big back then.  After all, that was before “never.”

So here is my wish list, where I think we can and should go – and why.  And yes, I know there will be people who tell me why we can’t or how difficult it will be.  But I have learned some modicum of patience and now that I know how long never is, I’m prepared to wait…

Mitochondrial DNA Data Base

As an industry, we are really missing the boat on this one.  Do you want to find out if anyone has tested who descends from your ancestor, Ann McKee born in 1805 in Washington County, Virginia?  You simply can’t do that.  Can’t be done today.

If you want to check on a male ancestor, her husband, Charles Speak, for example, or her father, Andrew McKee, you can go to the Speak or McKee projects and see if either line has been tested or you can go to Ysearch.

But you can’t do that for women.  Between Anne Mckee and me are 4 surnames (generations), Speak, Claxton, Bolton and of course, Estes.  Descending through females means dealing with multiple surnames, because every female in each family married someone with a different surname and began that domino effect of surname changes.  Anne McKee had 7 sisters and between all of them, they have literally hundreds of descendants today, some of whom carry her mitochondrial DNA.  I find it hard to believe that none of them have tested their mitochondrial DNA, but there is no way to find them if they have.

We need a centralized Mitochondrial DNA Data base where you can upload a Gedcom file or you can enter the direct mitochondrial DNA line via prompts.  Why prompts?  Because I can’t tell you how many people complete the oldest mitochondrial ancestor field with some man’s name.  If you prompt them with words like “her mother” at each step of the way, we won’t wind up with the wrong ancestral line attached to the mtDNA.

Recently someone sent me a request having to do with a particular family line and whether or not their ancestor was Jewish.  If I had been able to look in any data base, anyplace, I would have perhaps been able to see if anyone from that maternal line has tested, and the results, similar to projects and Ysearch.  In Ysearch, you can search by surname and it will also show you other pedigree charts in which the name is found, but Mitosearch has no such capability.

Unfortunately, this is a vicious circle.  People tell me that there isn’t the interest in mitochondrial DNA testing that there is in Y-line.  While that’s true, it’s not an absolute and the lack of these tools and data base is decreasing the interest and fostering a sense of hopelessness.  Adding this tool and encouraging people to use it, and prompting them through the steps, would not only increase interest, but would provide a huge service to the genetic genealogy community as a whole.

How many of your mitochondrial lines have been tested but you don’t know it because you have no tools to find them???

Personal Genome Mapping Projects

Today, those on the bleeding edge of autosomal technology are mapping their chromosomes – but we have to do this the hard way today.  There are no tools.

The first step is phasing if you are fortunate enough to have parents or someone you can positively identify from either side or both sides of your family.

This nicely divides your genome in half – your Mom’s side and your Dad’s side.  This allows you to determine, when you receive a match, based on whom else they match, mother or father’s line, which side the match is from.  This immediately narrows the match possibilities to half of your ancestors which is a huge benefit.

As this phasing and matching of people continues, it means that we can color in parts of our personal genetic map with certain ancestors.  For example, I know that I match 3 Vannoy cousins on chromosome 15, so the part of chromosome 15 that I received from my Dad is “Vannoy” and I can “color in” that part as confirmed Vannoy.

The first company to provide us with a tool to allow us to “color” our chromosomes by ancestral family and keep track of who is connected to which location will be a big winner overall.  Today, we do it manually on a spreadsheet.

This could be done much easier with automated tools and the information is available to do it.  Obviously some type of data base and Gedcom type tools would be required for this as well but perhaps some of the effort invested in the mitochondrial DNA data base could be leveraged here as well, especially if both were designed as an integral part of a large system encompassing and combining the genealogy with the genetic tools we need.

Ancestor Reconstruction Mapping Projects

The next logical step in this progression is the reconstruction of our ancestors (on paper, not literally) using genetic mapping.  If we can map our own genome, then we can take the parts of all of the descendants and map the ancestor.

For example, if I know that my common ancestor with all of these Vannoy cousins is John Francis Vannoy, born in 1719, through his various sons, then I can “create” a chromosome model of John Francis Vannoy and begin to reassemble him, sort of a genetic reconstitution.  Over time, as more cousins match and prove their genesis to John, then we can color in more parts of John or his ancestors that I don’t carry, but others do.

Maybe someday we can also further divide John into his ancestors.  His father was Francis Vannoy and his mother was an Anderson.  John Francis Vannoy carries parts of those and other ancestors as well.  His grandmother was an Opdyke and his other grandmother was possibly a Cornwall.

I’d love to have a chromosomal GIS map in the future.  For those who don’t know what a GIS map is, GIS stands for Geographic Information Systems and these maps can be peeled away in layers.  For example, we could start with ourselves and then “assemble” the Vannoy parts of us and also the Vannoy parts of other cousins into a “Vannoy” ancestor whose various parts, like Anderson, Cornwall, Opdyke and of course earlier Vannoys could then be layered onto their own maps so what we could virtually “see” what our ancestors looked like genetically.  Other layers of ourselves, like a Miller layer, an Estes layer, etc. could also be peeled away to become part of Johann Michael Miller and Abraham Estes, the progenitors of those lines as well.  Of course, this requires collaboration.  We could call these our Wiki-Ancestor maps.

Ancestor Matching

If we can map ancestors then we can also match those ancestors.  Let’s say I’m brick walled for example on my Moore line.  I have the Y-line, but I’m stumped beyond that with no matches that can take me beyond my brick wall in Halifax Co., Va.  My William Moore born about 1750 was the son of James, born about 1720 and wife Mary Rice, but William’s wife only has a first name, Lucy.  We have always suspected that she might be a Henderson.

Let’s say we can genetically map some of William and James.  In this process, we discover that parts of William’s children in that Moore line also match a Henderson ancestor who is being reconstructed by the Henderson project administrator.  If Henderson matches are only present for the children of William, not his siblings descendants, this would strongly suggest that his wife was a Henderson or at least closely related to them.

Taking this a step further, we have very few matches with Moores on the Y-line and all that we do match are brick walled as well, often later in time than we are.  If we can genetically map some of our Moore line, we can then potentially match another Moore line that is also being mapped, but that who doesn’t have any people who have tested the Y-line.  In some cases, one could still be related to the Moore line, but not through the Y-line, but through a son born illegitimately to a Moore daughter, hence carrying the Moore surname, but not the ancestral Moore Y chromosome.  That would explain why the Y-line doesn’t match, but would connect to the correct Moore family in spite of that little difficulty.

Ancestor matching would increase our opportunities of knocking down those pesky long-standing brick walls that have failed to fall with Y-line testing and genealogy alone.

Full Genome Testing

All of what I’ve described above is just the tip of the iceberg.  When full genome testing becomes available, it will be the power of the matching tools that make a difference.  Full genome testing without associated tools will be worthless.  I hope that we as a community take the opportunity now to lay the foundation for the wonderful future that lies in front of us, beckoning and begging us to pave the road to get there.  Our ancestors are waiting to be discovered.  I can see them just beyond the horizon, waiting to be plucked from obscurity.  Can you?