In May of last year, Li and colleagues reported that they had observed over 10,000 sequence mismatches between messenger RNA (mRNA) and DNA from the same individuals (RDD sites, for RNA-DNA differences) . This week, Science has published three technical comments on this article (one that I wrote with Yoav Gilad and Jonathan Pritchard; one by Wei Lin, Robert Piskol, Meng How Tan, and Billy Li; and one by Claudia Kleinman and Jacek Majewski). We conclude that at least ~90% of the Li et al. RDD sites are technical artifacts [2,3,4]. A copy of the comment I was involved in is available here, and Li et al. have responded to these critiques .
In this post, I’m going to describe how we came to the conclusion that nearly all of the RDD sites are technical artifacts. For a full discussion, please read the comments themselves.
It’s worth remembering why the Li et al. study has received so much attention. It is known that there are many thousands of bases in human transcripts that are sometimes modified from adenine to inosine (A->I) after transcription via RNA editing. However, these sites are generally found outside of protein-coding regions of mRNAs (i.e, in introns and untranslated regions), often in repeats (e.g., ). There are perhaps a few dozen known RNA editing sites that affect protein sequence, though more presumably exist (incidentally, many of these were found by Billy Li, one of the authors of the Lin/Piskol et al. technical comment).
In light of what we know about RNA editing, Li et al. was a bombshell. They found over 10,000 exonic RDD sites, most of which were not A->I changes (or C->U changes, another known type of RNA editing). These included many thousands of RDD sites that were predicted to change protein sequence. These results implied the existence of at least one, and probably more, novel mechanisms of gene regulation, and indeed called into question some basic assumptions used regularly in genetics (for example, that if one knows the sequence of a gene, one can predict with near certainty the sequence of the relevant protein).
So it’s not the existence of RDD sites, per se, that was so surprising about Li et al., but rather the major biological impact of the sites and the implied existence of previously unknown regulatory pathways.
Why I think nearly all of the RDD sites in Li et al. are false positives
Since the publication of Li et al., two groups have raised serious issues about the reported RDD sites [7,8]. Both concluded that the majority of these sites were false positives (If these authors are wondering why they’re not cited in our comment, it’s because Science didn’t let me add the citations during the editing process, sorry!).
The observation that I personally found most convincing is displayed in the plot at the beginning of this post. What I’m showing is that mismatches to the genome at RDD sites occur almost exclusively at the ends of sequencing reads. All three technical comments include this observation. Importantly, Lin/Piskol et al. take this analysis one step further. They show (in their Figure 2) that this effect is driven by the fact that mismatches to the genome at RDD sites tend to occur at the beginning of sequencing reads that go in the opposite direction of transcription (this effect is masked in my plot).
To argue that this pattern is not due to major technical problems, then, one needs to come up with a biological mechanism that accounts for the following observations:
- Mismatches to the genome at RDD sites are almost exclusively at the ends of sequencing reads (all three comments show this)
- In particular, mismatches at RDD sites are massively enriched at the beginning of sequencing reads in the opposite direction of mRNA transcription (Lin/Piskol et al. show this)
- Known A->I RNA editing sites do not have the above two properties (21/23 RDD sites that were previously observed as A->I edits pass the filters used by Pickrell et al.)
The response by Li et al.  proposes (but does not show) that these observations are due to A) clustering of RDD sites and B) co-occurrence of RDD sites with insertion/deletion RDD sites. They argue that these two effects could lead to mapping biases, such that sequencing reads carrying an edited base will only map to the genome if the mismatch is at the end of the read. There are two important points to make about this potential explanation. First, this proposed mechanism cannot account for observation #2 above, nor is it immediately clear how it would accomodate observation #3. Second, it is perhaps not obvious to others that widespread insertional RNA editing has not been observed in humans. Li et al. propose a new regulatory mechanism (widespread insertional RNA editing) that interacts with the new regulatory mechanism proposed in their original paper (novel RNA editing types) to create patterns in the data that look indistinguishable from technical artifacts. I think it’s fair to say that the burden of proof is on Li et al. to show that this explanation is more than adding an epicycle on an epicycle.
Lin/Piskol et al. instead propose a plausible artifactual explanation for all three observations. To understand this explanation, it’s important to note that Li et al. have not sequenced RNA itself, but rather cDNA generated from mRNA. To generate the cDNA, they added random short DNA sequences to each sample to act as primers for a DNA synthesis reaction. The argument is as follows: at some sites, the random primers were imperfect matches to the mRNA, but were still able to bind. During synthesis, the mismatches from the primers were incorporated into the cDNA, leading to a false signal of RNA editing (specifically at the positions where the primer initially bound; i.e., the beginning of reads in the opposite direction of transcription). In effect, at a small fraction of sites (but a large absolute number), Li et al. inadvertently performed site-directed mutagenesis on their cDNA library.
Addressing the validation experiments in Li et al.
If Lin/Piskol et al. are right that the majority of RDD sites are artifacts due to errors introduced during cDNA library generation, how can we explain the fact that Li et al.  were able to validate the presence of both “wild-type” and “edited” RNA and proteins at some sites? The technical comments include additional analyses showing that some RDD sites are due to mis-mapped reads from paralogous genes, and some due to previously unidentified genetic variation. At these sites, we argue that the two mRNA and protein forms are in fact present in the data, but that they derive from two different DNA forms, rather than resulting from RNA editing.
In their response, Li et al.  present no new validation experiments involving RNA or protein sequences (the closest thing is a single, indirect protein assay). Instead, they present new DNA sequence validation. It’s thus worth revisiting the validation experiments from the original paper.
The first type of validation performed in the original paper involved Sanger sequencing of RNA and DNA from 11 RDD sites. Both Kleinman et al. and Pickrell et al. specifically finger four of these sites (in the genes HLA-DQB2 and DPP7) as particularly likely to be false positives due to genetic variation. In the original paper, the validation data at these sites was not shown. In their response, Li et al.  do not present DNA sequence validation at these four sites; it’s unclear whether this is a tacit acknowledgement that these were false positives. Of the remaining 7 sites, 6 are of the A->I type, and indeed 4 of these were already known A->I editing sites. This validation, then, actually had a false positive rate of 80% for non-A->I sites (4/5); there is perhaps one site worth exploring further.
The other validation exercise performed by Li et al.  involved identifying peptide sequences that correspond to “edited” RDD sites. Pickrell et al. point out that many of the peptide sequences are in fact equally good matches to multiple genes. We propose, then, that these RDD sites are false positives due to mis-mapped reads from paralogous genes. In their response, Li et al.  show DNA sequencing data from several of these sites. However, to show that a paralog of a gene does not have a genetic variant would require sequencing the paralog as well; this was not done. The paralog issue remains, to me, the most plausible explanation for the sites Li et al. claim to have validated.
Are there any examples of new types of RNA editing in Li et al.?
The conclusion that at least 90% of the RDD sites in Li et al. are false positives is in some sense unsatisfying. After all, if the remaining 10% are all true positives then they’ve still identified hundreds of examples of new types of RNA editing! This is the spirit of the argument made by Li et al. in their response , when they say that they “view the discovery of RDDs as the important point and find the exact number to be less salient”.
However, I am skeptical of the remaining sites as well. It is likely that other types of errors besides those described in the technical comments exist, but are hard to detect by the methods we’ve used. Indeed, two separate analyses of RNA editing by Peng et al.  and Bahn et al.  filtered out false positive sites based on criteria similar to those used in the technical comments. They then tried to validate non-A->I sites by Sanger sequencing. Even after performing rigorous filtering, at least 50% of the remaining non-A->I sites were false positives. Given that this assay is also not a perfect filter, the true fraction of false positives must be even higher, and I am not convinced that it’s less than 100%.
In sum, by selecting for the most “odd-looking” regions of the genome in an analysis, one enriches for strange and unexpected technical artifacts. Even if a given systematic error affects only 0.001% of the bases in the genome, you’d still expect to run across it 30,000 times if you look at the whole genome! (Or maybe half that if you look only at bases expressed in pre-mRNA). As Daniel wrote regarding his own work on nonsense SNPs (which we know exist, but still were quite difficult to identify reliably), the more interesting something is, the less likely it is to be real.
Of course, it remains plausible that previously unidentified forms of RNA editing are active in humans, and RNA sequencing technology will certainly be important for determining whether such new forms exist. The comments published today, however, indicate that the analyses done by Li et al. are based on technical artifacts, and do not provide evidence for interesting biology. My opinion is that the Li et al. study should have been outright retracted. However, there is a small, but non-zero, probability that a handful of the reported non-A->I sites are real; readers can draw their own conclusion as to whether this justifies keeping the paper as part of the scientific record.
 M Li et al. (2011) Widespread RNA and DNA Sequence Differences in the Human Transcriptome. DOI: 10.1126/science.1207018
 Pickrell et al. (2012) Technical Comment on “Widespread RNA and DNA Sequence Differences in the Human Transcriptome”. DOI: 10.1126/science.1210484.
 Lin et al. (2012) Technical Comment on “Widespread RNA and DNA Sequence Differences in the Human Transcriptome”. DOI: 10.1126/science.1210624.
 Kleinman and Majewski (2012) Technical Comment on “Widespread RNA and DNA Sequence Differences in the Human Transcriptome”. DOI: 10.1126/science.1209658.
 M Li et al. (2012) Response to Comments on “Widespread RNA and DNA Sequence Differences in the Human Transcriptome”. DOI: 10.1126/science.1210419.
 Levanon et al. (2004) Systematic identification of abundant A-to-I editing sites in the human transcriptome. doi:10.1038/nbt996.
 Schrider et al. (2011) Very Few RNA and DNA Sequence Differences in the Human Transcriptome. doi:10.1371/journal.pone.0025842
 Peng et al. (2012) Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome. doi:10.1038/nbt.2122.
 Bahn et al. (2012) Accurate identification of A-to-I RNA editing in human by transcriptome sequencing. doi:10.1101/gr.124107.111.