The ENCODE project: lessons for scientific publication

The ENCODE Project has this week released the results of its massive foray into exploring the function of the non-protein-coding regions of the human genome. This is a tremendous scientific achievement, and is receiving plenty of well-deserved press coverage; for particularly thorough summaries see Ed Yong’s excellent post at Discover and Brendan Maher at Nature.

I’m not going to spend time here recounting the project’s scientific merit – suffice it to say that the project’s analyses have already improved the way researchers are approaching the analysis of potential disease-causing genetic variants in non-coding regions, and will have an even greater impact over time. Instead, I want to highlight what a tremendous feat of scientific publication the project has achieved.

I don’t mean simply by coordinating the simultaneous release of over 30 peer-reviewed publications across three separate journals (Nature, Genome Research and Genome Biology), although that alone is an astonishing and unprecedented performance that has no doubt taken years off the life of the project coordinator, Ewan Birney. I mean this: an interactive website that allows readers to step outside the forced linear narrative of the traditional scientific paper and instead follow “threads” through many different publications across multiple journals. And these: figures in the main manuscript that allows readers to interactively explore the data being provided. And this: a virtual machine that allows external researchers to independently examine and reproduce many of the computational analyses performed by the project.

All of these are genuine innovations for consortium genomics, of which ENCODE should be proud. The virtual machine provides an incomplete but still extremely helpful picture of the computational approaches taken by the project – and certainly far exceeds the standard level of software transparency of scientific consortia. In addition, the project’s commitment to open-access publications is admirable, and its cross-journal aggregation via the project website is frankly astonishing – a clear case where benefits to the scientific community have outweighed journal politics and publication economics.

Today’s announcements serve as a model for future large-scale science: a model that transcends the traditional publication approach where a paper is the endpoint, and that emphasizes reproducibility, transparency and accessibility over impact factors alone as metrics for success.

At the same time, it is worth noting the constraints that the standard embargo model of scientific publication have still imposed on the project. Much of the ENCODE data was mature and ready for use 12 months ago, and for those in the know has been a valuable component of functional annotation pipelines. Many of us in the genomics community were aware of the progress the project had been making via conference presentations and hallway conversations with participants. However, many other researchers who might have benefited from early access to the ENCODE data simply weren’t aware of its existence until today’s dramatic announcement – and as a result, these people are 6-12 months behind in their analyses.

In an accompanying commentary in Nature, Ewan Birney makes similar points:

Funders have considerable influence in how raw and analysed data are released, and should design policies that maximize reuse. Early data-release policies focused on how data should be shared before publication, with clumsy etiquette-based restrictions on the first publications of global analysis, such as waiting for the authors who generated the data to publish their analyses before others can publish on the entire data set. These agreements are starting to show their age and a lack of clarity.

The new era of analysis calls for a rethink, with more focus on the release of intermediate analysis throughout the project, so that the community can use the resource more fully during the project; the 1000 Genomes consortium has done well in this regard.

As always, Birney’s frankness here is refreshing. (You can read his other thoughts on the ENCODE voyage in an excellent post on his own blog, here.) I hope that this message is taken on board by both funders and other scientific consortia.

The ENCODE data published today, as well as those that will be released over the coming years, provide powerful insights into the functional significance of a large and poorly understood fraction of the human genome, and insights whose impact will be felt for many years. Hopefully the project’s lessons for the future of scientific publishing – both in terms of the things it did spectacularly well, and the areas where it might have improved – are similarly deeply felt.

  • Digg
  • StumbleUpon
  • Facebook
  • Twitter
  • Google Bookmarks
  • FriendFeed
  • Reddit

Page optimized by WP Minify WordPress Plugin