Those of us who live and breath genomics get very excited about sequencing DNA. Genomes Unzipped will be sure to cover the constant battles between sequencing companies to produce complete and accurate genome sequences for low prices; from our point of view, ‘low prices’ means affordable for consumers, or less than £1000 or so for a full sequence of an individual.
But why do we care about sequencing? You can go to a company like 23andMe and get a genotyping chip done; this won’t give you your full DNA sequence, but it will give you information about half a million sites on your genome, at the much lower cost of around £300. The sites picked for these chips are ones that are most variable in the population, and those that are well-studied. Why do we care about the rest? What more does sequencing give you?
What Do We Miss?
There are lots of types of variation that can occur in the 3 billion bases of the human genome. The simplest change, and the one that genotyping chips tend to look at, are Single Nucleotide Polymorphisms (SNPs); in these simple mutations a single base is changed (an A changes to a C, for instance). The database dbSNP knows of around 15 million such variations, and any two individuals will differ at around around 3 million single sites (more, if you’re from the more variable African populations).
23andMe can probably identify around 250 thousand of these differences (less if you’re not of Western European descent). New genotyping technologies, combined with statistical techniques like genotype imputation, can look at several million SNPs; these techniques may allow (more expensive) genotyping to find around half of your 3 million variations; even all bells-and-whistles genotyping is going to miss most of the data out there.
Of course, single-base mutations are not the only source of variation; neither are they the most interesting. Other types of variation are even less likely to be covered by genotyping. Each individual will have around 800,000 small insertions or deletions of DNA (called indels), very few of which are well covered by genotyping chips. Then there are the larger, potentially very interesting structural variants; thousands of bases or more that have been deleted, inserted, moved around or inverted; each individual will have a few thousand of these, and looking at them in the sort of detail required to figure out exactly what change has occured is virtually impossible with chips.
When you send your DNA for sequencing, you get the chance to see a massive chunk of all of these variations. Craig Venter’s super-high quality genome sequence (costing a crazy £45 million on first generation technology) found basically all variation in his genome, including 3.2 million SNPs, 900,000 indels and a range of other things. When Life Technologies sequenced a single African individual using low-cost second-generation sequencing, they found 3.8 million single-base variations, 230 thousand small indels, 565 large insertions or deletions, 91 inversions and a couple of crazier things, like gene fusions and complex rearrangements. The cost of this sort of analysis is currently massive compared to genotyping, but when you are done, you have captured a big proportion of the variations in your own genome.
(Note the difference between Venter’s 900k indels and Life Tech’s 230k indels; this is because 70% of indels are in repetitive regions of the genome that are hard to sequencing using second-generation sequencing. Our hope is that the next batch of technology, third-generation sequencing, will be able to plug this gap. If the whole First/Second/Third-gen stuff doesn’t mean anything to you yet, sit tight; we’ll cover all this in a later post.)
Why does this matter?
So we miss a lot of data when we settle for genotyping. But why does this matter? What do we fail to learn from chips that we could learn from sequencing?
The thing to understand about genotyping is that it is ultimately reactive, rather than proactive. You can only look for variants that you have seen before. As a result, you miss variants that are rare, that are population-specific, or that come from understudied populations.
Partly, this just means you are missing a lot of data. If you want to do something like ancestry testing (place yourself on a genetic map of Europe, pinning down exactly what Y chromosome haplotype you have, etc), the more data the better. But more than that, the variants that are missed are likely to be more population specific. As more and more individuals are sequenced from many populations, the potential for higher-resolution ancestry tracking appears. However, a 23andMe chip just doesn’t have enough variants on it to make full use of this new data.
The reactive nature of genotyping also means that you may have to do it again whenever new discoveries are made. 23andMe will make sure that they cover all the known risk loci for diseases, but what happens when we find new regions of the genome associated with disease? In that case, you have to hope that 23andMe happens to have them well covered on their chip, and if not, you have to wait until they bring out a new chip that covered the new discoveries. This is a especially a problem when the disease-associated variants are rare, as most new disease variants will probably be, or are caused by the sorts of variants not well covered by the chip. However, with sequencing, you have most of your genome there already, so any new discoveries can just be looked up on your genome sequence, without needing to go back to the lab to spend more time and money.
There is a particular type of variation that genotype chips can never get at, the type of variation that most people will find most interesting: variation that is unique to you, or to your family. If you get sequenced now, about 200,000 single-base variants in your genome will never have been seen before, ever. These are likely to include changes that modify proteins in a unique way, that may make them act differently in your cells. A big proportion of indels and structural variants will be novel, and these can include strange and exotic things: genes that have been swapped around, jumbled up, fused together, or deleted entirely. There may well be stretches of DNA, hundreds of base pairs long or longer, that have never been observed in another human. Regardless of how “useful” these personal oddities are, to be able to look directly at new genomic discoveries that live inside you makes them invaluable.
The Future of Personal Genomics
None of this is supposed to be an argument to fork out the (frankly ridiculous) $20k to get your own genome sequenced from something like Illumina’s personal service. Instead, I want to show that those of us who are interested in investigating our own genomes should be keeping close tabs on the sequencing wars. Illumina and Life Technologies are currently battling to bring the materials cost of a human genome into the low thousands of dollars, Complete Genomics is trying to sequence and analyse an entire genome as a service for $5,000, Pacific Biosciences, Life Technologies and Oxford Nanopore are bringing out new tech that may change the entire genomics field. Rather than just being esoteric events in the research and business communities, these developments will fundamentally determine if and when personal genomics can transition from a simple chip-based industry to a richer sequencing based one.
The first image is a colourised screenshot of aligned Illumina reads from the program MAQ, the second is an illustration of a large structural re-arrangement taken from the brilliant (and copyright-free) NHGRI Talking Glossary of Genetic Terms.