This is a guest post by Jeffrey Rosenfeld. Jeff is a next-generation sequencing advisor in the High Performance and Research Computing group at the University of Medicine and Dentistry of New Jersey, working on a variety of human and microbial genetics projects. He is also a Visiting Scientist at the American Museum of Natural History where he focuses on whole-genome phylogenetics. He was trained at the University of Pennsylvania, New York University and Cold Spring Harbor Laboratory.
As human geneticists, it is all too easy to ignore papers published about non-human organisms – especially when those organisms are plants. After all, how much can the analysis of (say) Arabidopsis genome diversity possibly assist in my quest to better understand the human genome and determine which genes cause disease? Quite a bit, as it happens: a fascinating recent paper in Nature demonstrates a number of lessons that we can learn from our distant green relatives.
By exploiting the small genome size of Arabidopsis (~120 million bases, compared to the relatively gargantuan 3 billion bases of Homo sapiens), researchers were able to perform complete genome sequencing and transcriptome profiling in 18 different ecotypes of the plant (similar to what we would call strains of an animal).
In a normal genome re-sequencing experiment, the procedure is to obtain DNA from an individual, sequence the DNA, align it to a reference sequence and then to call variants (i.e. differences from the reference). This approach is used by the 1000 Genomes Project and basically all of the hundreds of disease-focused human sequencing projects currently underway around the world. This approach allows researchers to relatively easily identify single-base substitution (SNP) and small insertion/deletion (indel) differences between genomes. However, the amount of variability that can be identified is restricted by the use of a reference: regions where there is extreme divergence between the reference and sample genomes are often badly called, and more complex variants (e.g. large, recurrent rearrangements of DNA) can be missed. Additionally, and crucially, sequences that are not present in the reference genome will be completely missed by this approach.
In this paper, Gan et al. were able to completely sequence the 18 Arabidopsis strains and – due to their relatively small size – assemble their genomes de novo (that is, reconstruct their genome sequences from scratch) without requiring comparison to a reference sequence. This allowed them to directly compare the 18 strains with the standard reference strain, Col-0, without the biases that arise from reference-based sequencing. One result in particular illustrates the value of de novo assembly: the analysis revealed that each species contained on average, 319 novel genes (or gene fragments) that were not part of the reference, which would have been completely missed using a standard resequencing approach.
In concert with the genome sequencing the authors performed complete transcriptome profiling of the ecotypes using sequencing of RNA (the messenger molecule that allows DNA information to be converted into functional proteins). This transcriptome data was directly aligned to the genomic sequence of the source ecotype. As anybody who has worked withRNA-seq (or ChIP-seq) knows, the variability between the reference genome and the sampled genome is a major factor in determining whether the reads for the experiment will align properly. If there is strong divergence between the genomes, corresponding reads will not align and the information contained in those reads is lost. By using the ecotype specific transcriptomes, the Arabidopsis researchers were able to directly align the reads and annotate the gene sequences without needing to worry about divergence from the reference: in other words, this provided a completely self-contained genome assembly and annotation.
All of this work benefited substantially from the unique characteristics of Arabidopsis, but it also provides a taste of what is to come for us humans as sequencing technology improves. Here are two key lessons:
The benefits of de novo assembly
By assembling the genomes, the researchers were able to find hundreds of novel genes not present in the standard reference genome (indeed, on the order of 1% of the predicted coding genes in each ecotype were absent in the reference). There’s no questions that humans also differ from one another in this way – a 2009 study based on de novo assembly of two individuals suggested that each human genome contains as much as 5 million bases of novel sequence not present in the standard reference. It’s likely that at least some of the novel genes found in the Arabidopsis ecotypes are responsible for the physical differences between them, and we should also expect some fraction of the variation in human traits and disease risk to be similarly determined by variably present genes.
The power of combining genome and transcriptome data
There are a number of advantages to combining data from multiple levels in the flow of genetic information through a cell: for instance, by looking at both genome and RNA sequencing data it’s possible to quickly check whether a variant found in an individual’s genome is actually expressed at all. However, as discussed above, aligning RNA reads to a generic reference genome raises substantial challenges. In this paper the authors were able to directly compare the transcriptome and the genome of the same ecotype.
To bring this into a human context, consider a researcher performing ChIP-seq experiments to find histone modifications in highly variable regions of the genome. A large percentage of this sequence varies between genomes, so many reads from a sample may fail to align accurately to the normal reference. If a reference built from the genome of the actual sample being analysed were available, then these reads would align, and it would be possible to accurately assess histone binding in that region.
Unfortunately, this type of experiment remains extremely challenging in humans: our vastly larger genomes (30x the size of Arabidopsis) require correspondingly greater amounts of sequencing and extreme computational capabilities to perform true de novo assembly. Great progress is being made on de novo assembly of human-sized genomes, but this currently requires very high sequence coverage and advanced library construction techniques rather than the simple shotgun sequencing technique that is the current workhorse for human whole genome analysis. An interim solution is the production of a personalized reference genome as performed by Rosowsky, et al. or Dewey et al. where the a personalized genome is created by interpolating into it the variants detected by sequencing, and can then be used as a target for aligning ChIP-seq or RNA-seq data. However, reference-based personalized genomes are still poor cousins to genuine de novo assembled personal genome: for one thing, they lack any large chunks of DNA present in an individual’s genome but missing from the reference.
As in many scientific situations, this is a case where researchers studying human genetics have good reason to be jealous of the techniques that their colleagues can apply in “simpler” organisms. However, since all of the individuals performing the research are humans, and not yeast, flies or even Arabidopsis, these are problems that should (and will) be solved. Given the rapid pace of advances in sequencing and computation, we can expect this type of experiment in personal genomics to become routine in the not-too-distant future.