Last week, scientists at the European Molecular Biology Laboratory reported that they had sequenced the genome of the Henrietta Lacks, or “HeLa”, cell line. This report was met with considerable consternation by those who (justifiably, in my opinion) wondered why scientists are still experimenting on a cell line obtained without consent in the 1950s . In response to a bit of a backlash, the researchers removed the HeLa sequence from the public internet, and even the paper itself might disappear from the formal scientific literature.
However, it is unfair to treat the authors of this paper as scapegoats for the systematic failure of scientists to deal with issues surrounding genomic “privacy”. Consider this important piece of information: the genome sequence of the HeLa cell line has been publicly available for years (and remains so).
This fact is a simple consequence of the fact that nearly every large-scale molecular biology techinique relies on DNA sequencing as a readout. Every time anyone does a genome-scale experiment (for example, RNA-seq or ChIP-seq) on HeLa cells and archives the data, they are explicitly making public the genome sequence of HeLa cells, and thus of Henrietta Lacks.
It’s quite trivial to demonstrate this point. HeLa cells are one of the cell lines used in the ENCODE project, an ambitious effort to functionally characterize the entirety of the human genome. As part of this effort, they (among many others) have been subjected to an array of genomic experiments, almost all of which of course generated sequencing data. I downloaded a few sequencing files from these experiments, and combined the HeLa sequence with other publicly available genome sequences from the 1000 Genomes Project (for full details, see below).
Does the data on HeLa cells contain enough information to say anything about Henrietta Lacks? Plotted above is the output of principal components analysis computed on genetic data from Nigerians, northern Europeans, Chinese, African-Americans, and HeLa. Each point represents an individual, and individuals that fall closer together are more similar genetically. Based on this plot, we can see that the HeLa cells are quite clearly from an African-American woman (or at least someone who is admixed between European and African populations).
These ancestry results are just a proof-of-principle, but any genetic analysis of disease risk or other phenotypic trait is of course just as trivial. This is not true just for HeLa cells, but across the board–the genomes of the donors of every cell line studied by ENCODE are publicly available, and can be analysed for ancestry or disease risk. Though the identities of the donors is not known in most cases besides HeLa, using techniques like those used by Gymrek et al., it may be possible to link cell lines to last names, and thus genetic information to individual people.
 See the spectacular The Immortal Life of Henrietta Lacks for the complete back story on this cell line.
I downloaded sequencing data from the following GEO accessions: SRR227441, SRR227442, SRR227445, SRR227446, SRR227472, SRR227473, SRR227505, SRR227506, SRR227556, SRR227557, SRR350914, SRR350915 SRR568260, SRR568261, SRR577378, SRR577379, SRR577392, SRR577393, SRR577429, SRR577430
These are all ChIP-seq experiments on HeLa cells from the ENCODE project.
I then mapped reads to human hg19 using bwa. To compare to the 1000 Genomes data, I used the genotypes from the Illumina OMNI array. To merge HeLa into these data, I randomly sampled a single sequencing read covering each site on the array (at all sites that had at least a single read covering it) and reported the sequence from that read as the HeLa “genotype” (calling heterozygotes is moderately annoying, so I didn’t try it). In total I genotyped 2095422/2177885 (96%) of sites successfully with this approach. I then ran PCA (using smartpca) on the genotypes from the YRI, ASW, CEU, CHB, and HeLa samples.