I agreed to make my 23andMe genotyping results publicly available as part of GNZ without a moment’s hesitation. This is in part because I knew the results were actually a bit dull (in a good way, I suppose) – I’m not at vastly increased or decreased risk for any diseases (based on research so far), and I was unsurprised to find out that I have blue eyes. I was also unsurprised that 23andMe identified me as most likely of north European ancestry.
Several hours after we released our data, however, I was pointed to a post where Dienekes Pontikos wrote about the results of running all our data through his ancestry prediction program. While just about everyone was quite confidently predicted to be almost entirely of northwestern European descent, this analysis gave me a point estimate of 20% Ashkenazi Jewish ancestry. Within hours, several people had asked me about this, and I had no real response. So I decided to take a look at the data myself; some basic analyses are below.
What does Dienekes’s program do?
First, a quick bit of background. The program used above is based on a paper exploring differences in allele frequencies (population structure) in European-Americans . In that study, the authors identified two major components of variation in ancestry, roughly corresponding to three clusters of individuals: those of predominantly northwest European descent, southeast European descent, and Ashkenazi Jewish descent. They assembled a list of ~300 genetic markers which were highly informative about ancestry in their sample, and made publicly available the allele frequencies of those markers in the three groups.
What Dienekes’s program does is use those allele frequencies (at the ~150 markers that are included in the 23andMe genotyping platform) to infer the proportional membership of an individual in each group. For example, if an individual has the genotype CC at a SNP, and the C allele has 20% frequency in northwestern Europe and 60% frequency in the Ashkenazi, that provides some evidence that the individual in question is more likely to be Ashkenazi. Summed across all loci, one can estimate the overall fraction of the genome of the individual from each group.
It’s this estimate that put me at 20% Ashkenazi ancestry. The confidence intervals for this estimate overlapped zero, indicating that there wasn’t enough data to make any confident claims about this, but it was certainly suggestive (and note that my GNZ colleague Vincent Plagnol was also predicted to have a sizeable amount of Ashkenazi ancestry).
But what does this estimate actually mean? It doesn’t really mean that I’m predicted to have 20% Ashkenazi ancestry. More precisely, it means that I carry a subset of alleles that are relatively rare in northwestern and southeastern Europe, but relatively common in the Ashkenazi Jewish population. The leap from this (undoubtedly true) statement to a statement about ancestry makes an extremely important modeling assumption: namely, that these three populations (northwest European, southeast European, and Ashkenazi Jewish) are the only three possibilities for my ancestry. This sort of assumption is implicit in every ancestry test available, and though this is not news (Dienekes gave this as a potential explanation for the results himself), it’s important to make explicit.
What other populations could I be partially descended from? Well, how about southwest Europe? (Ok, this isn’t just speculation. I know one of my grandparents is of Italian descent, and it’s known that southern European populations look a bit like the Ashkenazi population in genetic terms ). To test this, let’s look at some data.
Visualizing population structure with Genomes Unzipped data
To explore how the GNZ participants relate to European populations, I combined a few data sources: the European samples from the Human Genome Diversity panel , the 12 GNZ individuals, and a set of Ashkenazi Jewish individuals , all genotyped on Illumina arrays (though combining data sets raises the possibility of batch effects, I saw no major problems in these data).
To visualize the relationships between these individuals, I used principal components analysis, as implemented in the program smartpca . (When applied to genotype data , this method is a nice way to assess the average genetic relationships between individuals and populations [7,8])
In the plot below, each point is an individual, positioned on the first and second most important axes of genetic variation in this sample. The Ashkenazi population is in blue, the HGDP populations are in the other colors, and the GNZ individuals are in black. I’ve labeled myself, Vincent, and Dan Vorhaus in red. As you can see, the majority of GNZ participants cluster together between the French and the Orcadians (from Scotland). Dan, Vincent, and myself are all somewhat outside this cluster – Dan with the Ashkenazi population, Vincent with the French, and me with the French on component 1 and the Italians on component 2.
We can look then at additional components of variation. In the next plots are the second versus third components, followed by the third versus the fourth. This latter one is potentially telling: the fourth axis of variation separates the Ashkenazi population (including Dan, who I’m using as a posisitve control) from the rest of Europe. Neither I nor Vincent appear to have any detectable weight on this component.
The above analysis suggests that I do not, in fact, have any Ashkenazi Jewish ancestry (and neither does Vincent), but rather that the initial results were due to “hidden” south European ancestry. (I tested this a bit more, and things seem to hold up , and a separate analysis performed over at the Eurogenes blog came to a similar conclusion).
This is far from a fully rigorous treatment, of course. The analysis above averages information across the genome; a comprehensive analysis would segment my genome into parts descended from different populations (as done, for example, in HAPMIX ). At the moment, it’s unclear how well this type of method applied to current data would perform in distinguishing segments from closely related populations.
That said, I’ve satisfied my curiosity: based on my knowledge of family history and the above PCA plots, I’m convinced I have a bit of south European ancestry in a genome of largely northwest European background.
 Price et al. (2008) Discerning the Ancestry of European Americans in Genetic Association Studies. PLoS Genetics. doi:10.1371/journal.pgen.0030236
 Seldin et al. (2006) European Population Substructure: Clustering of Northern and Southern Populations. PLoS Genetics. doi:10.1371/journal.pgen.0020143
 Li et al. (2008) Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. Science. DOI: 10.1126/science.1153717
 Behar et al. (2010) The genome-wide structure of the Jewish people. Nature. doi:10.1038/nature09103
 Price et al. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics. doi:10.1038/ng1847
 Patterson et al. (2006) Population Structure and Eigenanalysis. PLoS Genetics. doi:10.1371/journal.pgen.0020190
 McVean (2009) A genealogical interpretation of principal components analysis. PLoS Genetics. doi:10.1371/journal.pgen.1000686
 Novembre and Stephens (2008) Interpreting principal component analyses of spatial population genetic variation. Nature Genetics. doi:10.1038/ng.139
 I identified the SNPs providing the most evidence for Jewish ancestry in my genome in Dienekes’s analysis and found their allele frequencies in the Italian population. Though I didn’t perform any formal analysis, these SNPs tended to have similar frequencies in the Italian and Ashkenazi populations. For example, my genotype at rs847851 is AA, and the A allele is at ~25% in northwest Europe and ~50% in the Ashkenazi. It’s also at 46% in Italy.
 Price et al. (2009) Sensitive Detection of Chromosomal Segments of Distinct Ancestry in Admixed Populations. PLoS Genetics. doi:10.1371/journal.pgen.1000519