Revisiting RNA-DNA sequence differences

A few months ago, I discussed a paper by Li and colleagues reporting a large number of sequence differences between mRNA and DNA from the same individual [1]. While some such differences are expected due to known mechanisms of RNA editing (e.g. A->I editing, see [2]), Li et al. reported an astonishingly high number of them, including thousands of events inconsistent with any known regulatory mechanism. These results implied at least one, and probably many, new mechanisms of gene regulation, and called into question some basic assumptions in molecular biology.

An alternative explanation for the observations of Li et al. is less exciting–imagine two genes with similar (but not identical) sequences, which produce similar (but not identical) mRNAs. If you accidentally attributed both mRNA sequences to the same gene, you could erroneously conclude that one of the two sequences arose via RNA editing of the other. According to a new paper in by Schrider and colleagues [3], this banal artifact accounts for the majority of the reported RNA-DNA sequence differences in Li et al.

Schrider et al. show that RNA-DNA mismatches are enriched in genes with close paralogs or copy number variants, both of which are consistent with the technical artifact mentioned above. However, their most striking result is that, at many of the putative RNA editing sites, the “edited” base from the mRNA is actually present in genomic DNA. To show this, Schrider et al. took advantage of the fact that low-coverage DNA sequencing data is available for the individuals used in the Li et al. study. They searched through these data to find genomic sequences matching the “edited” mRNA form. If these sites were truly due to RNA editing, they shouldn’t find any. Instead, at ~75% of the tested sites, they could find a genomic match to the “edit” in at least one individual. There are some potential complications with the interpretation of this number (as they note, the genomic data could include sequencing errors that happen to be the same base as the “edit”), but this observation strongly suggests that a majority of the sites identified by Li et al. are false positives due to this single technical issue.


[1] Li et al. (2011) Widespread RNA and DNA Sequence Differences in the Human Transcriptome. Science. doi: 10.1126/science.1207018

[2] Levanon et al. (2004) Systematic identification of abundant A-to-I editing sites in the human transcriptome. Nature Biotechnology. doi:10.1038/nbt996

[3] Schrider et al. (2011) Very Few RNA and DNA Sequence Differences in the Human Transcriptome. PLoS One. doi:10.1371/journal.pone.0025842

  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • Twitter
  • Google Bookmarks
  • FriendFeed
  • Reddit

8 Responses to “Revisiting RNA-DNA sequence differences”


  • Some may know that I have done similar things using Li et al.’s data. I mapped their NA06994 reads with bwa-sw (for RNA-seq, bwa-sw is more appropriate as it does not require full-length matches), called SNPs with samtools and compared the RNA-seq SNP calls to the consensus produced by Complete Genomics for the same individual. I only get ~100 potential RNA-editings in contrast to 1244 called by Li et al. The vast majority of these 100 sites are potentially A=>I modifications. More stringent filtering brings the number down to ~50 with increasing fraction of A=>I. And I bet all the remaining non A-to-I changes are artifacts due to my simplistic procedure or FN by CG. One may argue that for short 50bp reads, bwa-sw may have high FN. But this FN should be below 30%, which can be measured by comparing genomic SNP calls.

    I have also had a look at the Sanger validated sites. Several non A=>I changes are actually common SNPs (they even include the highly polymorphic HLA-DQB2). One change is likely to be sequencing artifact, judged from the trace file. In the end, all the validated sites are A=>I.

    MS sequences very short peptides. As they have seen sequences from reads, they are likely to find peptides translated from it. But the question is not if the sequence is translated somewhere, but is whether it is translated from the gene the sequence is mapped to (not from a homolog gene). We do not know the answer to the latter question given short peptides. To me, MS seems to prove little.

    Together with the new paper, I am fairly convinced that there are few RNA editings and that all RNA editings are A-to-I (perhaps plus a small fraction of C-to-U deamination).

  • Hi Heng,

    Thanks for the comment; that sounds pretty damning. I have also come to the conclusion that all (or nearly all) the editing in these cells is A->I. It’s impossible to identify the exact problem with every single site reported by Li et al., but my conservative estimate is that at least 90% are clearly false positives, and the remainder are suspect.

    It’s reassuring to me that everyone who looked at this closely (well, except perhaps for the original authors) seems to have come to the same conclusion using different approaches.

  • Hello Joe and Heng,

    Wow! Given your comments it seems that by focusing solely on the problem of paralogous sequence we grossly overstated the robustness of the original result!

    Heng’s comment in particular made me curious about the 15 MS-validated RDD sites, so I took a closer look at our results and found that 12 of them match genomic sequence according to the stringent criteria described in our paper, and all 15 match either a paralog in the reference genome or at least one genomic read from the 27 individuals examined in the original paper. In light of this I would agree with Heng’s assessment that MS provides no additional support for RDDs.

  • Oops, the numbers from my comment above are from looking at Table 4 from the original paper, not Table 3. When looking at the 17 MS-validated sites in Table 3, 15 meet the criteria described in our paper for matching genomic sequence in a sequenced individual or the reference genome, and the other two matched at least 2 genomic reads each. The conclusion that these peptide sequences could all be explained by paralogs stands.

  • Thanks Dan, that’s very convincing.

  • @Dan: In your paper, you use bwa to map their RNA-seq reads and find similar number of editings. I wonder what you get if you also try bwa-sw like my experiement. Bwa/bowtie forces a read to match in full-length. This may lead to spurious calls towards the end of a read when it, for example, bridges a splice junction. Bwa-sw is naturally immune to such an artifact.

  • For those confused, Heng Li != Mingyao Li.

    So he’s not criticizing his own paper.

  • Right, when mapping the genomic reads we did not use Bwa-sw, but we only counted alignments where the RDD site was at least 5bp away from either end of the read. So I do not think we are overestimating the number of RDDs that match a paralog because of this problem.

Comments are currently closed.

Page optimized by WP Minify WordPress Plugin