UPDATE 3/17/12: A more extensive analysis of the paper discussed in this post is here. Several groups have concluded that at least 90% of the sites identified are technical artifacts
The “central dogma” of molecular biology holds that the information present in DNA is transferred to RNA and then to protein. In a paper published online at Science yesterday, Li and colleagues report a potentially extraordinary observation: they show evidence that, within any given individual, there are tens of thousands of places where transcribed RNA does not match the template DNA from which it is derived . This phenomenon, called RNA editing, is generally thought to be limited (in humans) to conversions of the base adenosine to the base inosine (which is read as guanine by DNA sequencers), and occasionally from cytosine to uracil. In contrast, these authors report that any type of base can be converted to any other type of base.
If these observations are correct, they represent a fundamental change in how we view the process of gene regulation. However, in this post I am going to point out a couple of technical issues that, if not properly taken into account, have the potential to cause a large number of false positives in this type of data. The main point can be summarized like this: RNA editing involves the production of two different RNA and/or protein sequences from a single DNA sequence. To infer RNA editing from the presence of two different RNA and/or protein sequences, then, one must be very sure that they derive from the same DNA sequence, rather than from two different copies of the DNA (due to, for example, paralogs or copy number variants). Although this issue has the potential to be a large source of false positives in a study like this, I will discuss an additional technical problem that could also result in false positives.
How to identify sites of RNA editing?
The experimental approach of the authors is easy to explain: they sequenced the mRNA expressed by an individual (or rather, cDNA from a cell line derived from the individual), obtained DNA sequences from the same individual, and compared the two. Any difference between the RNA and DNA should be due to RNA editing.
The complication comes from the fact that, with current technology, it’s impossible to sequence an entire mRNA or genome in a single pass. Instead, what the authors have are short reads (of 50 bases) from the mRNA of the individual, and reads of various length from the DNA of the individual (from the 1000 Genomes Project). What they have to do, then, is match these sequencing reads to the genome (or transcriptome) to see where they came from.
It is this step that can cause problems in an analysis like this. The authors limit themselves to considering positions in the genome where reads map uniquely; that is, reads that match a single position in the genome better than any other position (Actually, Li et al. (2011) map reads to the transcriptome—the regions of the genome annotated as expressed. This adds additional complications in this step, but for most of what follows these complications are not relevant). However, having a read map uniquely is far from a guarantee that it’s in the right place—choosing the best spot involves a number of assumptions, including how you weight insertions and deletions and possible sequencing errors. How could this influence the results in this paper? Let’s take a couple examples:
Paralogs can confuse read mapping
One example of a potential RNA editing event is in the gene HMGN2 (from their Table 1). Below, I’m showing a screen shot from the UCSC Genome Browser in the region of the putative site (you may have to open the image in a new window to get it to a readable size). What I’m showing is the DNA sequence, the position of the RNA (the putative editing site is in position 26,674,349, where the authors claim the G is sometimes converted to a T), and, crucially, a track showing other places in the genome that are similar to this region. By my count, there are 62 other regions of the genome with high percent identity to this region (in the screen shot below, some of these are cut off). To be sure that the T allele at this position is not from any of these other 62 regions, it would be important to show evidence that 1) these 62 regions are not expressed (inspecting some of them shows that some are indeed genes), or 2) if some of them are expressed, that the sequence in the corresponding position in that region is not a T.
These sorts of mapping issues are well appreciated in the literature on SNP calling from sequencing data (which is another situation in which one is looking for a difference between a sequencing read and the genome); a naïve SNP caller that just looks for differences between aligned reads and a genome will output tens (or probably hundreds) of thousands of false positive SNPs which must be filtered out by various criteria . For a list of criteria that are used for this filtering step, I suggest the paper published by the 1000 Genomes Project  (see the Supplementary Information of that paper). As far as I can tell, Li et al. (2011) did not remove false positive calls by any of the criteria generally used in this closely related problem.
Does this matter? I examined some of the reads supporting “editing” at the site mentioned above from the data in Li et al. (2011) (GSE25840). For example, take a read with the sequence TTTTTTTTTTTAAAAGCTATTTTGTTAGCACACAGAACACTTCATTGTTG. In the alignment file I downloaded, this read is mapped over the edited position with extremely high confidence (an alignment quality score of 255). However, you can take that sequence and match it to the human genome using BLAT (just copy and paste into the box after opening that link). There are at least 20 or 30 places in the genome that match that read nearly as well as the “best” match covering the edited site. It’s unclear why this is labeled as a confident alignment, but clearly it should not be. Any errors like this in mapping to paralogous regions can generate false signals of RNA editing. [NOTE ADDED 5/26/2011 by JP: See comment regarding read mapping here. The read mapping software used by the authors gives a mapping quality score of 255 to all reads which map uniquely to the genome, regardless of how unique the read actually is.] For this particular example, it seems possible that in this individual either 1) the relevant genomic base is actually a T at one of the 20-30 regions which closely match the sequencing read (due to either an unannotated SNP or error in the reference genome), or 2) there is an additional copy of the region, absent from the reference human genome, which carries a T at the relevant position .
Strikingly, many of the examples the authors give in their Table 1 fall in regions with high percent identity to other parts of the genome, where these sorts of false positive calls are known to be rampant in SNP calling. An interesting example in which this is not the case is below.
Splice junctions cause read mapping problems
The authors point out that a number of the places where RNA and DNA don’t match fall near splice sites, and they speculate that RNA splicing induces editing. Let’s look at one such example, in the gene CNBP (from their Table 1). This position does not appear to have high percent identify to anywhere else in the genome. Below, I’m showing a couple of screen shots from the UCSC Genome Browser surrounding this splice site (the putative editing site is a T->A change at position 130,372,812). You can see that there are a number of different annotated splice sites at the 3′ end of the intron (the top picture), which are all spliced to a single 5′ splice site (the bottom picture). Read off the sequence of the shortest of the isoforms (the 3rd and 6th from the top)–the sequence of the gene goes CAGG|CATC, where the bar represents the position of the intron. Now consider reading through that splice site (for example, on the top isoform)–the sequence is CAGGCTTC. Imagine you had a read with the sequence CAGGcatc (a perfect match to the short isoform) and tried to map it. The mapper has a choice: either use a gap in the alignment to map across the intron without any mismatches in the read (this is the correct alignment), or map without a gap, but with a single T->A mismatch (this is an incorrect alignment). Many mappers will choose the latter, and this misalignment will cause a false signal of RNA editing.
Again, I found one of the reads in the data from Li et al. (2011) supporting the “edited” site discussed above. This read has the sequence TGCAGTCCTTGGCAATGTGGCCACCTCTACCGCAGTTATAGCAGGCATCC (again, you can map this read to the genome using BLAT), and again has an extremely high mapping quality of 255. I’ve italicized the sequence around the putative editing site. This read is an exact match to the short isoform (with no mismatches), but is mapped with two mismatches and no gaps to the longer isoform by Li et al. (2011), causing a spurious signal of RNA editing. The evidence for RNA editing from this sequencing read is a consequence of the “preference” of the read mapper for mismatches over gaps in alignment. Because of this effect, I’d be particularly suspicious of putative RNA editing events that fall near splice sites.
In this post, I’ve discussed a couple of technical issues that could lead to false inference of RNA editing from data like that used in Li et al. (2011). First, mismapping of reads in paralogous regions could lead to false signals of RNA editing. These false signals would even be replicated in follow-up experiements like those done by Li et al. (2011), because the two forms of RNA and protein are indeed present in the cell. However, the two forms of RNA and protein do not come from the same DNA sequence, and thus are not evidence of RNA editing. Second, mapping biases around splice sites (and other sorts of insertions/deletions in the genome) will cause mismapping and false inference of RNA editing.
So are all the results in Li et al. (2011) due to effects like those above (apart from the A->G editing, which is known to be common)? I can think of a number of analyses that might shed some light on this. First, do most putative RNA editing sites fall into regions with high homology to elsewhere in the genome or near splice sites, like I described for HMGN2 and CNBP above? Looking at Table 1 in Li et al. (2011), I see only one potential example that doesn’t, and it’s an A->G site in AZIN1, but it would be worth examing this on a global scale. Second, what about the biases described by people working in SNP calling—do putative edited nucleotides show a strand bias (do only sequencing reads mapping to one strand or the other show the editing event?) or position bias (do the putative edited nucleotides tend to fall near the start or end of the sequencing reads)? If biases like these are absent, perhaps these technical issues do not have a strong effect.
Finally, I should reassert that RNA editing is indeed an interesting and potentially important phenomenon in humans; that said, skepticism of some of the claims in this paper seems warranted.
 Li et al. (2011) Widespread RNA and DNA Sequence Differences in the Human Transcriptome. Science. doi: 10.1126/science.1207018
 The 1000 Genomes Project Consortium (2010) A map of human genome variation from population-scale sequencing. Nature. doi:10.1038/nature09534.
 Shouldn’t the genomic DNA sequences from this individual control for this? Not necessarily. The sequencing reads generated from genomic DNA by the 1000 Genomes Project have different lengths than those generated from the RNA by Li et al. (2011). Even subtle differences in the mapping properties between the RNA reads and DNA reads will lead to a subset of sites in the genome where RNA reads map well and DNA reads do not (or vice versa), particularly in multi-copy regions.