In a previous post I discussed copy number variation, a form of genetic variation not broadly reported by DTC companies. In today’s post I provide a very simple program that allows one to identify potential deletions on the basis of high density SNP genotypes from a parent-offspring trio, and report on the results of running this program on data from my own family.
The program uses an approach that I applied as a graduate student to mine deletions from the very first release of data from the International HapMap Project in 2004. The idea, explained in my last post, is to look for stretches of homozygous genotypes interspersed with mendelian errors, which might indicate the transmission of a large deletion. Let’s be clear, this is a simple analysis that most programmers and computational biologists would find straightforward to implement. It is probably a good practice problem for graduate students and would-be DIY personal genomicists.
I obtained 23andMe data from both my mom and dad, and, with their consent, ran the three of us through the program. I was mildly surprised to find only two potential deletions; I had previously speculated that one would find 5-10 deletions per trio with the 550K platform used by 23andMe.
These two deletions were dramatically different in size and level of support. The first, in a non-coding region on chromosome 7, appears to be about 50kb in size (the first SNP outside of the deletion on each end is rs10228390 at 8788633 and rs917038 at 8839252), contains 18 SNPs compatible with a deletion transmission, and 9 of these SNPs (a full 50%) are “Mendelian errors” (meaning that they are not consistent with the pattern of transmission you would expect to see from parent to child). Furthermore, one can confirm that this deletion is common and has been previously identified by searching the Database of Genomic Variants (click image to zoom):
The second deletion is on chromosome 15, between 31505941 bp and 31519908 bp, falling within an intron of the gene RYR3 , a brain ryanodine receptor. This putative deletion is no more than 14kb in size and spans only 3 SNPs, 2 of which are mendelian errors. Again looking at DGV, there are CNVs that have been reported to span this region, however they are much, much larger (>150kb), reflecting the lower resolution approaches used in those studies. No variant has been reported at this locus with a high-resolution technology (like next-sequencing or high-res oligo arrays) so it is difficult to say if this is a known variant or not.
There are two principle explanations for the low yield of deletions from these data. One is statistical sampling: the number of large deletions in my family is just on the low end of the spectrum of all the families out there. The second, and more appealing to my intuition, is that in the quest for extremely high quality SNP genotypes, 23andMe has (appropriately) replaced many of the SNP genotypes in regions of deletions with missing data.
In order to test this latter hypothesis, I asked the question, is there evidence for an enrichment of missing data in regions of known, common deletions, in my own data? I was impressed to see that, across the entire set of SNPs, I had only 1453 missing data points, a call rate of 99.75%. I made a list of 1,008 deletions that were recently reported to have a population frequency greater than 5% in Europeans. Out of 975 SNPs that fell within the boundaries of these common deletions, 84 (8.6%) were “no calls”, a massive enrichment over the proportion for the total dataset (0.25%).
For those of you with data from trios (two parents and a child) who would like to look for deletions in your own genome, I’ve posted some code and directions here. In coming weeks I hope to make the code easier to run, and also add a test of the presence of known, common deletions by looking for excess homozygosity or missing data. Given some reference files that contain the location and frequency of known deletions, as well as allele frequencies for all SNPs in these regions, it should be possible to pull out large stretches of homozygosity that are statistically improbable from a copy-neutral (non deleted) genome. This is something that can be applied to single genomes (no families required!). Finally, there are other, sophisticated approaches for finding deletions out there now, that work on SNP genotype data from unrelated individuals, which others may want to explore: