The genome scans currently offered by major personal genomics companies provide information about only one kind of genetic variation: single nucleotide polymorphisms, or SNPs. However, SNPs are just one end of a size spectrum of variation, reaching all the way up to large duplications or deletions of DNA known as copy number variants (CNVs). Over the last decade we have learned that CNVs are a surprisingly common form of variation in humans, and they span a formidable chunk of the genome. While there are about 3M-3.5M bases of variation due to SNPs within an individual genome (in say, a typical person of European descent), there are at least 50-60M variable bases due to CNVs.
For the personal genome enthusiast with their SNP chip data from 23andMe or deCODEme in hand, there are two important practical questions: (1) can I learn about my CNVs using SNP chip data; and (2) will that information be useful?
In this post I will discuss ways to squeeze information about CNVs out of current SNP chip datasets. However, I will also argue that for the purpose of cataloging one’s own genetic variants, and especially for the purpose of understanding the complete functional consequences of one’s own genome sequence, it may well be worth waiting for whole genome sequencing.
What is a CNV?
CNVs are gains or losses of contiguous DNA sequence that can be identified by comparing multiple genomes. Strictly speaking, CNVs can range in size from 1bp to over 1Mb in size, although for historical reasons some people qualitatively divide this size spectrum into small “indels” (< 50 or 100bp) and large CNVs (everything else). The vast majority of CNVs are small.
Detecting CNVs directly from SNP chip intensity data
CNVs can be directly identified using SNP chip data by quantifying the amount of DNA hybridizing to each SNP probe: people with more copies of a particular region will have more DNA binding to that location on the chip than people with fewer copies. This can be measured by looking at the intensity of the signal at that location on the chip.
However, most CNVs are small, and the density of probes on SNP arrays is low enough that the majority of CNVs will not actually contain a probe. More importantly, personal genomics companies don’t currently provide customers with the raw data required to estimate intensity; unless this changes, customers won’t be able to use this approach on their own genome scan data.
Still, let’s say we were able to access intensity data for our genome scans – what could we find?
Our current best estimate is that there are 800,000 CNVs >= 1bp in a single genome. This number scales down to approximately 2700 when considering events >1 kb, which is the lower end of the size spectrum possible to detect with the SNP arrays used by DTC companies. The number of CNVs that are actually detected from SNP chip data will depend on the algorithm used and the quality of the experiment.
With a typical experiment and conservative analysis I would expect on average of 70-90 CNVs to be detected on Illumina 1M, and 20-40 CNVs with Illumina HumanHap-550, two of the platforms used for personal genetics (by deCODEme and 23andMe, respectively). The probes for both of these platforms were developed prior to the generation of many of the new high-resolution CNV maps, and newer Illumina SNP chips should have much better coverage of large, common CNVs.
Using your family
There is another, indirect way that CNVs can be detected from SNP chip data without any access to the raw intensity files: by tracing the patterns of inheritance of particular SNPs within your family, and looking for places where that pattern is inconsistent with normal expectations. Departures from “Mendelian” inheritance can provide clues about a CNV lurking in that region of your genome.
The concept behind the method is simple: SNP genotyping algorithms that are naïve to CNVs often mis-call a person who is heterozygous for a deletion as homozygous for the nucleotide that is present. What this means is that when a deletion is transmitted from parent to child, the SNP genotypes that are called at that position can give the impression that the deletion-bearing parent hasn’t transmitted any genetic material at all! This would be the case when the child inherits a base from the undeleted parent that is not present in the deleted parent. Of course, this could happen due to plain old genotyping error, so such incompatibilities need to be unusually clustered on a chromosome in order for us to be statistically confident that there is a deletion present.
Curiously, the power of this method depends on SNP density, so that families from the populations with greatest diversity will have the most success chance at finding a deletion this way. This type of analysis was first done genome-wide in 2005, when two groups used the 1 million SNPs from HapMap I to identify about 11 deletions/trio in a population of European ancestry and 20 deletions/trio in a population from Nigeria. The false discovery rate was empirically estimated to be 14% (these numbers are from the Conrad, et al. version). Based on these results I would expect the numbers to scale to around 5-10 deletions discovered per trio using a 550K SNP chip.
Indirect detection of CNVs via imputation
Many common CNVs can be assayed indirectly – or “imputed” – using your SNP genotypes. Publicly available resources make it possible to define a set of nearby SNPs that are strongly associated with a particular CNV. Then, using freely available software, one can impute (make a statistical best-guess estimate) of your CNV genotypes based on your own SNP data.
This is a statistical exercise, so a probability will be assigned to each genotype, but in many cases SNP data are informative enough to impute CNVs with high accuracy. In a recent study of Craig Venter’s genome, the authors concluded that as much as 75% of the SVs detected in his genome could have been imputed from public datasets. This is the first analysis of this nature, so the numbers may fluctuate, but I suspect that we will be able to impute common CNVs with broadly the same accuracy as common SNPs.
In the short term imputation is probably the best way to assay common CNVs in a single genome, and it is the preferred approach currently taken by most researchers performing genome-wide association studies of common traits and diseases using SNP chips: allow the direct CNV genotyping to be done by specialists in a shared resource like the HapMap samples, and then impute.
Importantly, this approach currently only works for CNVs that are present at a reasonable frequency in the population (i.e. >5%), and will not allow one to access rare CNVs or – even more interestingly – events that have occurred uniquely in your own genome rather than being inherited from your parents, so-called “de novo” CNVs. In individuals of European descent, we estimate that there are ~5,000 CNVs >1kb that are common enough to be potentially imputed with today’s resources. The soon-to-be-published 1000 genomes pilot project has generated genotypes on over 500,000 smaller CNVs (i.e. indels), many of which will now be imputable.
Do CNVs matter?
Contrary to the intuition that large polymorphisms should have large effects on traits, the impact of common CNVs on common traits studied thus far appears to be surprisingly small. Fewer than 20 common CNVs have been directly associated with a common disease or non-disease trait in a standard GWAS. The number of common CNVs that are candidates to explain known trait associated SNPs is not much more impressive: in an analysis of over 1500 trait-associated SNPs reported in the NHGRI GWAS database, fewer than 5% were found to be on the same genetic background as a large, common CNV.
I don’t interpret this result as evidence that large, common CNVs are not functional, but it is evidence that common CNVs are not involved in the traits that have been prioritized for genetic analysis with GWAS. Interestingly, the function of genes that tend to be included in CNV is highly non-random, and thus one may be able to create good hypotheses regarding what traits are mediated by common CNV!
Just as with rare SNPs, it seems likely that rare and unique CNVs will be more informative about disease risk.
Variation discovery without sequencing
Let’s summarize the current state of affairs. A CNV analysis with SNPs alone is likely to yield two things: (a) a set of genotypes for known, common CNVs (via imputation) that are mostly uninformative about one’s biology, and, if one has access to family data, (b) the identification of a very small number of deletions (<20) made without regard to their frequency or genomic location. If I had my parents’ data to hand, I would be excited to try (b), as it allows one to discover personal genomic variants, something otherwise impossible to do with SNP array data. It is possible to do a crude version of the parent-offspring trio approach described above just using Excel. For those who do find something interesting, be it a large and/or unreported deletion, it will give a taste of what it is like to make a scientific discovery. For those who come away empty-handed, don’t fret… that is also an authentic scientific experience!