Archive for the 'Journal Club' Category

Phantom Heritability: What it does and doesn’t mean

Just out in prepublication at PNAS is a paper from Eric Lander’s lab, entitled, somewhat provocatively The mystery of missing heritability: Genetic interactions create phantom heritability. The authors suggest that certain types of gene-gene interactions could be causing us to underestimate how much of the heritability of complex traits has been uncovered by our genetic studies to date.

There has been an awful lot of talk about this research since Eric Lander talked about it at ASHG a few years ago, and the paper itself has generated quite a bit of discussion on- and off-line. Razib Khan reported on the paper last week, giving a good summary. He mentioned a press release about the paper issued by the advocacy organisation GeneWatch, which confuses the additive heritability discussed in this paper with the total heritability of diseases (a distinction explained below), and uses this to draw conclusions about how this result alters the promise of personal genomics. This just goes to show how much confusion there already is out there about this subject.

I have a more detailed post up on Genetic Inference about this paper, the strength of the argument, and what it means for the field. Here I am just going to pull out what I think are some important take-home points about this paper:

1) Broad sense heritabilities (the kind that are clinically important for e.g. risk prediction) have NOT been significantly overestimated The type of heritability we ultimately care about, the broad or total heritability, is how much total phenotypic variation is captured by genetics, or equivalently the correlation between identical twins in uncorrelated environments. The figure at the top of this post shows a plot that I made using Zuk et al’s equations, comparing true broad sense heritabilities, against what would be estimated based on twin studies (I have matched the colouring etc to Figure 1 of the paper). The twin study estimator of heritability is a robust estimator of total heritability for heritabilities less than 0.5. Above that, LP epistasis causes growing overestimation – it can make a 50% heritable trait look like a 65%, and 70% look like a 95%. It does not make weakly heritable traits look strongly heritable, just strongly heritable traits look very strongly heritable.

2) This paper is discussing additive heritability. This is a specific form of heritability that acts “simply” – half of it is passed on to offspring, siblings share an amount proportion to how related they are, and the genes that underlie it do not interact with each other. We do not know how much heritability acts like this, but various lines of evidence have made us think that it is a relatively good model, and most competing models have been incompatible with this evidence, or look contrived. What Zuk et al have done is produce a set of plausible, simple and non-contrived models (Limiting Pathway or LP models) that look pretty much indistinguishable from additivity using many of the tests we have run, but can act very differently in twin studies. Under these models, twin studies will overestimate the additive heritability (i.e. make us think that a larger proportion of heritability acts “simply”). The equivalent plot to the top of the page for estimating additive heritability, which you can see here, shows massive overestimation of additive heritability across the spectrum.

3) There is no real evidence that these LP models apply (and in fact there are still a few reasons to believe additivity could still broadly apply, see my other post for details). The issue is that we cannot conclusively rule these models (or models like these) out, and therefore the heritability explained by the genetic variants we have found so far is very uncertain.

4) This is important because our measures of “heritability explained” by the genetic variants we have found look at how much additive heritability is explained. These measures have in general told us that we have only explained a small proportion (generally < 25%) of additive heritability – but if in fact the heritability is largely not additive, but we are treating it like it is, we could in fact have explained a higher proportion of heritability than we believe. This would mean that the “missing heritability” is missing not because we have not found the right genetic risk factors, but because we have not found the right model to use. This could be good news: the genetic variants we have discovered could in fact be used to predict disease a lot better than they we can at the moment, if only we can find the right model to use them with.

Guest post from Alex Kogan: Size and populations matter–let’s understand why

[This is a guest post by Alex Kogan. Last week, Ed Yong at Not Exactly Rocket Science covered a paper positing an association between a genetic variant and an aspect of social behavior called prosociality. On Twitter, Daniel and Joe dismissed this study out of hand due to its small sample size (n = 23), leading Ed to update his post. Daniel and Joe were then contacted by Alex Kogan, the first author of the study in question. He kindly shared his data with us, and agreed to an exchange here on Genomes Unzipped. Our comments on the study are here; this is Alex’s reply.]

It’s a truism that resonates across science: Size matters when doing and interpreting the statistical (and practical) meaning of a study. But the size of what? Well, it’s quite a few things—all of which are very important in understanding what a study is ultimately telling us. One of the first numbers researchers focus on is the p-value. The p-value relies on a bit of counterintuitive logic: It represents the percentage of times you would get an effect as big as you got (or bigger) if there is really no effect in the general population. So we first assume that there is really no difference in some outcome between two groups across the general population (we call this the null hypothesis), and then we ask what are the chances of us finding the difference that we found (or bigger) given this assumption. If this percentage is low (many fields adopt a p = .05 standard, or a 5% chance that we’d get the effect we got or bigger if there is really no effect in the general population), then we can reject the initial idea that there is no difference in the general population. So what have we learned if the p-value is .05 or lower? That there is likely a difference in the general population—how big this difference is remains a mystery, however; the p-value never answers that question.
Continue reading ‘Guest post from Alex Kogan: Size and populations matter–let’s understand why’

Size matters, and other lessons from medical genetics

Size really matters: prior to the era of large genome-wide association studies, the large effect sizes reported in small initial genetic studies often dwindled towards zero (that is, an odds ratio of one) as more samples were studied. Adapted from Ioannidis et al., Nat Genet 29:306-309.

[Last week, Ed Yong at Not Exactly Rocket Science covered a paper positing an association between a genetic variant and an aspect of social behavior called prosociality. On Twitter, Daniel and Joe dismissed this study out of hand due to its small sample size (n = 23), leading Ed to update his post. Daniel and Joe were then contacted by Alex Kogan, the first author of the study in question. He kindly shared his data with us, and agreed to an exchange here on Genomes Unzipped. In this post, we expand on our point about the importance of sample size; Alex’s reply is here.

Edit 01/12/11 (DM): The original version of this post included language that could have been interpreted as an overly broad attack on more serious, well-powered studies in psychiatric disease genetics. I've edited the post to reduce the possibility of collateral damage. To be clear: we're against over-interpretation of results from small studies, not behavioral genetics as a whole, and I apologise for any unintended conflation of the two.]

In October of 1992, genetics researchers published a potentially groundbreaking finding in Nature: a genetic variant in the angiotensin-converting enzyme ACE appeared to modify an individual’s risk of having a heart attack. This finding was notable at the time for the size of the study, which involved a total of over 500 individuals from four cohorts, and the effect size of the identified variant–in a population initially identified as low-risk for heart attack, the variant had an odds ratio of over 3 (with a corresponding p-value less than 0.0001).

Readers familiar with the history of medical association studies will be unsurprised by what happened over the next few years: initial excitement (this same polymorphism was associated with diabetes! And longevity!) was followed by inconclusive replication studies and, ultimately, disappointment. In 2000, 8 years after the initial report, a large study involving over 5,000 cases and controls found absolutely no detectable effect of the ACE polymorphism on heart attack risk. In the meantime, the same polymorphism had turned up in dozens of other association studies for a wide range of traits ranging from obstet­ric cholestasis to menin­go­­coccal disease in children, virtually none of which have ever been convincingly replicated.
Continue reading ‘Size matters, and other lessons from medical genetics’

Going green: lessons from plant genomics for human sequencing studies

This is a guest post by Jeffrey Rosenfeld. Jeff is a next-generation sequencing advisor in the High Performance and Research Computing group at the University of Medicine and Dentistry of New Jersey, working on a variety of human and microbial genetics projects. He is also a Visiting Scientist at the American Museum of Natural History where he focuses on whole-genome phylogenetics. He was trained at the University of Pennsylvania, New York University and Cold Spring Harbor Laboratory.

As human geneticists, it is all too easy to ignore papers published about non-human organisms – especially when those organisms are plants. After all, how much can the analysis of (say) Arabidopsis genome diversity possibly assist in my quest to better understand the human genome and determine which genes cause disease? Quite a bit, as it happens: a fascinating recent paper in Nature demonstrates a number of lessons that we can learn from our distant green relatives.

By exploiting the small genome size of Arabidopsis (~120 million bases, compared to the relatively gargantuan 3 billion bases of Homo sapiens), researchers were able to perform complete genome sequencing and transcriptome profiling in 18 different ecotypes of the plant (similar to what we would call strains of an animal).

In a normal genome re-sequencing experiment, the procedure is to obtain DNA from an individual, sequence the DNA, align it to a reference sequence and then to call variants (i.e. differences from the reference). This approach is used by the 1000 Genomes Project and basically all of the hundreds of disease-focused human sequencing projects currently underway around the world. This approach allows researchers to relatively easily identify single-base substitution (SNP) and small insertion/deletion (indel) differences between genomes. However, the amount of variability that can be identified is restricted by the use of a reference: regions where there is extreme divergence between the reference and sample genomes are often badly called, and more complex variants (e.g. large, recurrent rearrangements of DNA) can be missed. Additionally, and crucially, sequences that are not present in the reference genome will be completely missed by this approach.
Continue reading ‘Going green: lessons from plant genomics for human sequencing studies’

Report on clinical genome sequencing

The PHG Foundation, an independent genomics think-tank, has launched a new report on next generation sequencing and its impact on health and health systems. The Report, Next steps in the sequence: the implications of whole genome sequencing for health in the UK can be freely downloaded and aims to provide a comprehensive overview of the many and varied issues relating to clinical genome sequencing.

When planning the work, we were motivated by the astonishingly rapid development of fast, affordable whole genome sequencing (WGS) technologies, which are set to change many aspects of health care. The sheer quantity and complexity of the information generated by genome sequencing, along with ever-changing understanding of the function of genomes in health and disease, presents new challenges for health systems.

The Report reviews the technologies, informatics pipeline and key clinical applications of WGS, and as well as the economic, ethical, legal and social implications and organisational challenges of offering WGS within the UK NHS. The final two policy chapters outline different scenarios for testing, storing and returning results, and contains 10 key recommendations reached with the help of several expert stakeholder workshops.

Continue reading ‘Report on clinical genome sequencing’

Revisiting RNA-DNA sequence differences

A few months ago, I discussed a paper by Li and colleagues reporting a large number of sequence differences between mRNA and DNA from the same individual [1]. While some such differences are expected due to known mechanisms of RNA editing (e.g. A->I editing, see [2]), Li et al. reported an astonishingly high number of them, including thousands of events inconsistent with any known regulatory mechanism. These results implied at least one, and probably many, new mechanisms of gene regulation, and called into question some basic assumptions in molecular biology.

An alternative explanation for the observations of Li et al. is less exciting–imagine two genes with similar (but not identical) sequences, which produce similar (but not identical) mRNAs. If you accidentally attributed both mRNA sequences to the same gene, you could erroneously conclude that one of the two sequences arose via RNA editing of the other. According to a new paper in by Schrider and colleagues [3], this banal artifact accounts for the majority of the reported RNA-DNA sequence differences in Li et al.

Schrider et al. show that RNA-DNA mismatches are enriched in genes with close paralogs or copy number variants, both of which are consistent with the technical artifact mentioned above. However, their most striking result is that, at many of the putative RNA editing sites, the “edited” base from the mRNA is actually present in genomic DNA. To show this, Schrider et al. took advantage of the fact that low-coverage DNA sequencing data is available for the individuals used in the Li et al. study. They searched through these data to find genomic sequences matching the “edited” mRNA form. If these sites were truly due to RNA editing, they shouldn’t find any. Instead, at ~75% of the tested sites, they could find a genomic match to the “edit” in at least one individual. There are some potential complications with the interpretation of this number (as they note, the genomic data could include sequencing errors that happen to be the same base as the “edit”), but this observation strongly suggests that a majority of the sites identified by Li et al. are false positives due to this single technical issue.


[1] Li et al. (2011) Widespread RNA and DNA Sequence Differences in the Human Transcriptome. Science. doi: 10.1126/science.1207018

[2] Levanon et al. (2004) Systematic identification of abundant A-to-I editing sites in the human transcriptome. Nature Biotechnology. doi:10.1038/nbt996

[3] Schrider et al. (2011) Very Few RNA and DNA Sequence Differences in the Human Transcriptome. PLoS One. doi:10.1371/journal.pone.0025842

Genetic risk prediction in complex disease

I thought I’d point out a review article in Human Molecular Genetics that just came out in (open access) preprint form by Luke and myself on genetic risk prediction in complex disease. In it we discuss some of the strengths and weaknesses of genetic and risk prediction compared to classical epidemiological predictors, different statistical modelling considerations, and the effect of GWAS on prediction. Readers of this space might find the conclusion of some interest, where we consider some of the societal aspects of trying to bring the interpretation of genomes into mainstream medical practice.

Notes on the evidence for extensive RNA editing in humans

The “central dogma” of molecular biology holds that the information present in DNA is transferred to RNA and then to protein. In a paper published online at Science yesterday, Li and colleagues report a potentially extraordinary observation: they show evidence that, within any given individual, there are tens of thousands of places where transcribed RNA does not match the template DNA from which it is derived [1]. This phenomenon, called RNA editing, is generally thought to be limited (in humans) to conversions of the base adenosine to the base inosine (which is read as guanine by DNA sequencers), and occasionally from cytosine to uracil. In contrast, these authors report that any type of base can be converted to any other type of base.

If these observations are correct, they represent a fundamental change in how we view the process of gene regulation. However, in this post I am going to point out a couple of technical issues that, if not properly taken into account, have the potential to cause a large number of false positives in this type of data. The main point can be summarized like this: RNA editing involves the production of two different RNA and/or protein sequences from a single DNA sequence. To infer RNA editing from the presence of two different RNA and/or protein sequences, then, one must be very sure that they derive from the same DNA sequence, rather than from two different copies of the DNA (due to, for example, paralogs or copy number variants). Although this issue has the potential to be a large source of false positives in a study like this, I will discuss an additional technical problem that could also result in false positives.

Continue reading ‘Notes on the evidence for extensive RNA editing in humans’

How do variants outside genes influence disease risk?

Over the last several years, the number of genetic variants unambiguously associated with disease risk has grown dramatically. However, interpreting these signals has been extremely difficult—most of the identified variants do not disrupt genes, and indeed many don’t fall anywhere near genes (this observation has even led some to discount these signals entirely). To an investigator interested in following up on these signals, this is somewhat depressing: how can we hope to explore how polymorphisms affect disease risk if they don’t seem to fall in any sort of genome annotation that we understand?

In this context, I thought I’d point to an important paper that, among many other things, gives the first systematic evidence that variants which influence disease are not just randomly scattered across the genome, but instead tend to fall in particular regions—in particular, enhancer elements (regions where DNA-binding proteins interact with DNA to influence gene expression).

The authors rely on the fact that, in the cell, DNA is wrapped around proteins called histones, which control how accessible the DNA is to things like transcription factors (see above figure). These proteins can be chemically modified, and it is now clear that particular patterns of modifications are predictive of the function of the DNA in the region—some modifications indicate transcribed genes, others regions of enhancer activity, others repressed regions, etc.

What the authors did in this study was generate genome-wide maps of several histone modifications in nine different cell types, and use this data to predict the function of each 200 base pair segment of the human genome in each cell type. There are a number of interesting analyses of these “maps” of genome function in the paper, but for our purposes here there’s one of particular interest: the authors took sets of SNPs associated with various diseases and simply asked, are these variants enriched in regions with any particular functional prediction? And indeed, for several phenotypes, there is a striking enrichment of association signals in enhancers elements in a relevant cell type. For example, SNPs which influence lipid levels are enriched in enhancers in a liver cancer cell line, and SNPs which influence the autoimmune disease lupus are enriched in enhancers in a lymphoblastoid cell line.

As these types of functional maps are generated in more cell types, I imagine there will be more stories like this. The problem with interpreting disease association studies, it seems likely, is largely due to our lack of understanding of genome function.

—-
Citation: Ernst et al. (2011) Mapping and analysis of chromatin state dynamics in nine human cell types. Nature. doi:10.1038/nature09906

Are synthetic associations a man-made phenomenon?

Early last year David Goldstein and colleagues published a provocative paper claiming that many GWAS associations are driven not by common variants of modest effect (the canonical common disease – common variant hypothesis underpinning GWAS) but instead by a local cluster of lower frequency  variants that have much bigger effects on disease risk. They dubbed this hypothesized phenomenon “synthetic association” and the term quickly became a genetics buzzword. The paper was widely discussed in both the specialist and mainstream media, and caused quite a stir among academic statistical geneticists.

That debate has been re-opened today by a set of Perspectives in PLoS Biology: a rebuttal by us (Carl & Jeff) and our colleagues at Sanger, a rebuttal by Naomi Wray, Shaun Purcell and Peter Visscher, a rebuttal to the rebuttals by David Goldstein and an editorial by Robert Shields to tie it all together.

Continue reading ‘Are synthetic associations a man-made phenomenon?’


Page optimized by WP Minify WordPress Plugin