Readers who don’t have a Science subscription can access a pre-formatted version of the manuscript here. In this post I wanted to give a brief overview of the study and then highlight what I see as some of the interesting messages that emerged from it.
First, some background
This is a project some three years in the making – the idea behind it was first conceived by my Sanger colleague Bryndis Yngvadottir and I back in 2009, and it subsequently expanded into a very productive collaboration with several groups, most notably Mark Gerstein’s group at Yale University, and the HAVANA gene annotation team at the Sanger Institute.
The idea is very simple. We’re interested in loss-of-function (LoF) variants – genetic changes that are predicted to be seriously disruptive to the function of protein-coding genes. These come in many forms, ranging from a single base change that creates a premature stop codon in the middle of a gene, all the way up to massive deletions that remove one or more genes completely. These types of DNA changes have long been of interest to geneticists, because they’re known to play a major role in really serious diseases like cystic fibrosis and muscular dystrophy.
But there’s also another reason that they’re interesting, which is more surprising: every complete human genome sequenced to date, including celebrities like James Watson and Craig Venter, has appeared to carry hundreds of these LoF variants. If those variants were all real, that would indicate a surprising degree of redundancy in the human genome. But the problem is we don’t actually know how many of these variants are real – no-one has ever taken a really careful look at them on a genome-wide scale.
So that’s what we did. Working with the pilot data from the 1000 Genomes Project – whole-genome sequences from 185 individuals with ancestry from Europe, East Asia and West Africa – we made a list of nearly 3,000 predicted LoF variants found in these genomes, and put them through a punishing barrage of experimental validation and computational filtering. In addition, we were fortunate enough to have the HAVANA team on our side, and their annotators (coordinated by Adam Frankish) spent hundreds of man-hours manually checking the annotations underlying about 700 genes containing these predicted variants. At the end of that frankly pretty exhausting process we were left with a grand total of 1,285 variants that we think probably result in genuine loss of function.
So we were left with a core set of “real” LoF variants – now we could start doing some interesting biology.
Firstly, after applying every filter we could think of, we ended up with a fairly stable estimate of the true number of LoF variants per individual – we think each of us (at least those of us of non-African ancestry) carries about 100 of these variants, with 20 of those being present in a homozygous state (i.e. both copies of the affected gene are inactivated). These numbers are perhaps a little higher in Africans, consistent with their overall higher levels of genetic variation.
A relatively small number (253) of these LoF variants were quite common, with at least one individual in the study having both copies knocked out. Some of these are pretty interesting: we found known LoF variants associated with things like blood type, muscle performance and drug metabolism, for instance, along with a vast array of inactivated olfactory receptors (genes involved in smelling particular substances). However, we found little evidence for a role of these common LoF variants in risk for common, complex diseases like type 2 diabetes, despite a frankly heroic effort by GNZ comrade Luke Jostins to track these down (for those in the know: imputing the entire 1000 Genomes data-set into all 17,000 people in the WTCCC1 study – yeah, that isn’t easy). That fits with what we saw when we looked at the affected genes: those hit by common LoF variants tended to be less evolutionarily conserved, had fewer known protein-protein interactions, and were also more likely to have other similar genes in the genome. That suggests that in general these genes are more redundant and less functionally critical: no surprise, given they’re knocked out in a non-trivial fraction of the population.
But overall, we found that genuine LoF variants tend to be extremely rare: the majority of our high-confidence LoF variants were found in less than 2% of the population, with many probably being far rarer than that. This suggests that LoF variants are often mildly or severely deleterious, and have thus been stopped from increasing in frequency by natural selection. That in turn suggests that those rare LoF variants are where most of the action is in terms of effects on disease risk. Sure enough, we found 24 known severe disease-causing mutations in our LoF set, involved in horrible diseases like osteogenesis imperfecta and harlequin ichthyosis, as well as 21 LoF variants in known disease-causing genes – these were all found in only one copy in the affected individuals.
There’s more, including an analysis of the effects of these variants on RNA expression (courtesy in part of GNZ’s own Joe Pickrell), and an algorithm we think might be useful for predicting whether a novel mutation is likely to actually cause disease – but I’m going to skim over those, and instead do my best to hammer home a point about error rates that I think is absolutely critical to appreciate for those currently setting up their own sequencing projects.
The more interesting something is, the less likely it is to be real
We are rapidly entering a world in which sequencing genomes is becoming commonplace. It is not unusual now for a PhD student to contemplate a project in which the complete sequences of the protein-coding genes (the exomes) of dozens of disease patients will be examined. Over the next 12 months tens of thousands of disease patients and healthy controls will have their exomes or complete genomes sequenced.
As the data roll off those projects, researchers will naturally find their eyes drawn to the genetic variants with the largest predicted effects on function – in other words, the ones most likely to be involved in disease. And while the smart researchers will spend time making validating these variants and confirming their effects on gene function, others will simply assume that in a genome with 99.5% overall accuracy, it’s highly unlikely that such interesting variants will prove to be false. And in the high-pressure world of human disease genomics, they will be tempted to push their findings into the journals as quickly as possible before someone else beats them to that brand new disease gene.
So here’s the thing: the greater the predicted functional impact of a sequence variant, the more likely it is to be a false positive.
The reason for this will be pretty clear to the Bayesians in the audience (large-effect variants have a very low prior), but can take a while to fully appreciate for those without a natural statistical intuition. This effect occurs because variants with large effects on function are more likely to be harmful, so in general they are weeded out of the population by natural selection. In other words, the genome is highly depleted for variants with large functional effects.
Error, on the other hand, is more or less an equal opportunity annoyance – false positives, due either to DNA sequencing problems or issues with interpretation (e.g. thinking a region is protein-coding, when in fact it isn’t), appear without much regard for their effects on gene function.
So sequence changes with large effects on function are depleted for real variation, but have roughly the same overall error rate as the rest of the genome. As a result, the ratio of false changes to total observed changes – the false positive rate – is far, far higher for these variants.
This was a massive problem for our study, which looked at variants with about as large a predicted impact on function as it’s possible to get. As a result, we spent a very long time running experiments and designing filters, and ended up throwing away over half over the loss-of-function variants in our initial set: these discarded variants were either likely sequencing errors, apparent problems with gene annotation, or variants that simply didn’t look like they had much of an effect on gene function (for instance, gene-truncating variants that only remove the last 1 or 2 percent of a gene).
I should emphasise that a lot of these errors were the result of working with very early-stage sequencing data – some of the sequencing done for the pilot project is now over four years old. Modern sequencing data and variant-calling algorithms are far more accurate. In addition, as a result of this project and various other efforts gene annotation is now greatly improved – the Gencode annotation we use in the 1000 Genomes Project has been tweaked extensively over the last couple of years through the efforts of the HAVANA team and others. In fact, one of the few consolations of having to wade through a sea of errors was that we were able to fix many of them, so they won’t come up over and over again in future sequencing projects.
Still, while impressive, these are quantitative improvements, and the lesson stays the same: if you’re a PhD student working on large-scale sequencing data, and you find a fascinating mutation in your disease patient, be sure you validate the absolute hell out of that thing before you start drafting your paper to Science. The more fascinating it looks, the more you should disbelieve it – that’s as true in human genomics as it is in any other field of science.
So now we have a catalogue of LoF variants that we’ve stared at for a long time, and we’re now pretty sure that most of them are real. What can we actually do with this list? This is a question I’ll be pursuing pretty hard in my new position in Boston.
The first step is to make it bigger and better. The 1000 Genomes Project is rolling on – there are over 1,100 people now completely sequenced, and there will be some 2,500 by the end of the year – and we’re continuing our analysis of LoF variants as the project grows. But we’ve also set our sights on larger cohorts: for instance, the tens of thousands of exomes sequenced over the last couple of years in disease-focused projects. As we improve our filters and start applying them to human sequence data on this scale we expect to pretty rapidly build up a definitive catalogue of LoF variants present at any appreciable frequency in the population.
Then comes the exciting bit: figuring out what effect they might have on human variation and disease risk. The last paragraph of our paper hints at where this project might go:
Finally, we note that our catalogue of validated LoF variants comprises a list of naturally occurring “knock-out” alleles for over 1,000 human protein-coding genes, many of which currently have little or no functional annotation attached to them. Identification and systematic phenotyping of individuals homozygous for these variants could provide valuable insight into the function of many poorly characterized human genes.
Think of this as the Human Knockout Project – an effort take each gene in the genome and look for someone who has that gene completely knocked out. In some cases those people won’t exist – they will have died in utero. In other cases they will suffer from awful diseases – we already know of many such examples. But there will also be genes where eliminating function has more subtle or even beneficial effects. Surprisingly, we already know of several examples where rare LoF variants actually protect against disease, such as PCSK9 and heart disease, IFIH1 and type 1 diabetes, and CARD9 and Crohn’s disease – and these are very interesting to pharmaceutical companies seeking potential drug targets for these diseases.
We actually still know very little about the normal function of most genes in the human genome. If we can find individuals who are missing those genes, and take a close look at how they differ from the rest of the population, we’ll get clues that can then be followed up in downstream functional experiments.
This is a massively ambitious project, of course. We’ve made a start in this paper, by defining an initial list of LoF variants and coming up with filters that will help us track down more – but to really nail the effects of these variants will require a global, collaborative effort, (at least) hundreds of thousands of participants, and lots of dedicated researchers. Still, it can be done.
A personal note: the benefits of collaboration
I was very fortunate in this project to be able to work closely with some of the sharpest minds in human genomics – and without these collaborations this study would have been impossible. I owe a particularly huge debt of gratitude to Suganthi Balasubramanian from Yale and Adam Frankish from Sanger, who were my comrades in arms on this project from the beginning. The two senior authors on the paper, Mark Gerstein and my supervisor Chris Tyler-Smith, were also instrumental in shaping the project and driving it forward.
However, the author list for this paper is long for a good reason. When we wanted to look at the effects of LoF variants on common disease, I talked to Luke Jostins and Jeff Barrett, who know the area better than anyone else I know; to get a sense of their effects on gene expression, I talked to Joe Pickrell and Stephen Montgomery, who published back-to-back Nature papers on interpreting RNA sequencing data in 2010; to look at indels, I spoke to Zhengdong Zhang, Kees Albers and Gerton Lunter, while for structural variants I went to Klaudia Walter and Bob Handsaker. Others, like Ni Huang and James Morris, contributed heavily to multiple parts of the paper. Each of them brought enthusiasm and tremendous expertise to the project.
Research is never easy, but it’s certainly a lot more pleasant when you get to do it alongside people you like and respect.