Though this site is largely dedicated to discussions of personal genomics, I’d like to use this post to discuss some of my recent work (done with Athma Pai, Yoav Gilad, and Jonathan Pritchard) on mRNA splicing. Our paper, in which we argue that splicing is a relatively error-prone and noisy process, has just been published in PLoS Genetics .
As many readers of this site know, most human genes are encoded in the genome in a bizarre fashion: the protein-coding parts of the gene are split into small chunks (called exons) separated by large swathes of non-coding, largely useless DNA (called introns; see the figure above). In order to fashion a functional mRNA (and thus a functional protein) from this type of organization, the cell first transcribes a long pre-mRNA, then decides which parts of the pre-mRNA are exons and removes the remainder via a process called splicing.
Though the process of splicing is somewhat convoluted (and was likely slightly deleterious when it initially evolved in the ancestor of all eukaryotes), it can be regulated by the cell in clever ways such that the same gene can produce different proteins in different conditions via alternative splicing. Importantly, genetic variation between individuals can also influence splicing.
Earlier this year, we published a paper in which we used high-throughput sequencing of mRNA in about 70 individuals to, among other things, try to identify the precise genetic variants influencing variation in splicing between individuals . In the course of doing this, we developed methods for identifying previously unobserved splice forms. Using these methods, we saw something that was then (to us) somewhat perplexing: an abundance of never-before-seen splice junctions and splice forms in nearly every gene we examined. This paper presents the follow-up work on that observation.
What do we show?
After polishing our methods a bit more, we ultimately identified about 300,000 splice junctions in our data, about half of which had never before been observed. These splice forms are generally at low abundance in the cell and show no evidence of evolutionary conservation. Our conclusion, then, was that we are measuring the error rate of splicing reactions on a genome-wide scale.
Doing a back-of-the-envelope calculation with these data, we estimate that the error rate of the average splicing reaction in the human genome is about 0.7%. This works out to a few percent of transcripts from the average gene being mis-spliced (since most genes undergo multiple splicing reactions).
This might seem like a rather high error rate. However, consider that tens or hundreds of bases are necessary for the fully efficient removal of an intron, and that a mutation that disrupts any of these bases can cause a reduction in the efficieny of the reaction. Every generation, these mutations occur, and if they’re not sufficiently deleterious, it’s inevitable that some will reach fixation and be carried by all humans. This idea is not new, of course; Michael Lynch has referred to this fact as part of the “intrinsic cost of introns” . But what we’ve shown is that these new sequencing technologies allow us to measure these sorts of things on a much larger scale than was previously possible.
 Pickrell et al. (2010) Noisy splicing drives mRNA isoform diversity in human cells. PLoS Genetics.
 Pickrell et al. (2010) Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. doi:10.1038/nature08872.
 Lynch (2007) The origins of genome architecture. Sunderland, Mass.: Sinauer Associates.