Over the course of the past year or so, I’ve been working (with Jonathan Pritchard) on a statistical method for learning about the history of a set of populations from genetic data. Much of this work is described in a paper we recently made available as a preprint . However, as many readers will know, writing a paper involves deciding which results are important to the main point (and worth fleshing out in detail), and which aren’t. In this post, I’m going to describe some results and thoughts that didn’t quite make the cut, but which I think merit a small note. In particular, I’m going to discuss how having a demographic model for a large number of populations might be used to identify genes important in adaptation, and describe results from humans and dogs.
Imagine you have genome-wide genetic data (from SNP arrays, genome sequencing, or whatever) from a number of populations in a species. A common way to visualize the relationship between your populations is to use a tree. For example, below I’ve built a tree of the 53 human populations from the Human Genome Diversity Panel (using the data from Li et al. ).
Of course, populations within a species don’t just split, they also mix via gene flow. These types of events are not modeled when forcing populations into a tree. Below, I’m showing a heatmap that depicts how well each pair of human populations fits the above tree. The dark greens, blues, and blacks represent pairs of populations that are, in some sense, too far away from each other in the tree. These populations are potential candidates for admixture events (indeed, you can see known admixed populations like the Mozabite jump out from a plot like this). This is the sort of signal we focus on in our paper.
While populations that don’t fit a tree well are candidates for gene flow, what about individual SNPs that don’t fit the tree? These SNPs are ones that have changed frequency in ways that are surprising given the demographic history of the populations. A plausible hypothesis, then, is that they (or linked variation) have been the target of natural selection.
To explore this possibility, I used the human data from Li et al.  and dog data (from 82 dog breeds) from vonHoldt et al. . I first built trees of the populations in each species. The human tree is the one shown above, and the dog tree is the one from our paper. I then applied a simple metric that measures how well the allele frequencies at any given SNP match the tree . The “interesting” SNPs are those with the worst fit to the tree. Below, I’m showing the 10 most “interesting” SNPs from the dog data; I report their chromosomal position, the nearest gene, and the phenotype influenced by variation in this region (if one is known). I made no attempt to group together SNPs that tag the same signal.
The massive selection pressures imposed on dogs by human breeders are apparent from this analysis. Like a similar analysis by Boyko et al. , we observe that the most outlying SNPs are already known to influence things like body size and shape and coat color.
Now let’s look at the top 10 SNPs from the human data (links on each SNP go to maps showing their worldwide distribution):
In humans, it appears much less is known about the selective pressures (assuming these outlier SNPs have indeed experienced selection). We see two of the well-established selected genes (SLC24A5 and EDAR) at the top of the list, but the remainder have no known phenotype (though I assume many of these have shown up in other scans for selection). It is plausible that these genes play important roles in the phenotypic differences between human populations.
An approach like that described above seems potentially promising for quickly identifying SNPs that show extreme differences in allele frequency (and thus have potentially been the targets of natural selection) in a large set of populations. This approach is somewhat more model-based than Fst, and somewhat less model-based than Bayesenv , and thus may be useful in some settings.
 Pickrell and Pritchard (2012) Inference of population splits and mixtures from genome-wide allele frequency data. hdl:10101/npre.2012.6956.1
 Li et al. (2008) Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. DOI: 10.1126/science.1153717
 vonHoldt et al. (2010) Genome-wide SNP and haplotype analyses reveal a rich history underlying dog domestication. doi:10.1038/nature08837.
 The tree predicts a variance/covariance matrix of allele frequencies (this is W in the notation of ). For any given SNP, I compute the sample variance/covariance matrix (let’s call this V), and then compute the sum of squared differences between the entries of V and W. I then find the scaling factor that minimizes this sum of squares; i.e., I find the scalar x that minimizes the sum of squared differences between the entries of V and xW. The remaining sum of squared differences is a measure of the “badness of fit” of the SNP to the tree. Obviously there are a number of complications to the interpretation of this number (e.g., it will be larger for SNPs with a larger x, and I make no attempt at accounting for the correlation between different entries of the matrix).
 Boyko et al. (2010) A Simple Genetic Architecture Underlies Morphological Variation in Dogs. doi:10.1371/journal.pbio.1000451.
 Coop et al. (2010) Using Environmental Correlations to Identify Loci Underlying Local Adaptation. doi: 10.1534/genetics.110.114819.